Open iamciera opened 5 years ago
I made a program that generates a random sequence with motifs thrown in at a certain frequency
This program generates a random sequence with motifs added at certain frequencies that can be picked for each motif individually. GC content can be picked as well. Sequence in between motifs is neutral (doesn't contain motifs) and is generated by siteout.
Awesome!! This looks great. Let's discuss it tomorrow, I think we can start generating (and testing) the sequences soon!
On Wed, Feb 13, 2019, 6:23 PM Thomas Lane <notifications@github.com wrote:
I made a program that generates a random sequence with motifs thrown in at a certain frequency
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DiscoveryDNA/team_neural_network/issues/28#issuecomment-463459924, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpvJutxPKGfucqFSutSiEz21loHrjXmks5vNMihgaJpZM4W_7JB .
Great work @thethomaslane!
The number one thing I want to do with this is put it into its own repository. Can you move all this to a new repo call "neutral_sequence_generator"? Then we can discuss further development there.
I also talked with a Python developer here at BIDs and they think that the Siteout program can easily be converted to Python 3. I might have this done.
A few comments/questions while they are on my mind.
motif_dist(motifs_test,[.001,.003,.01,.007], 1000)
, this essentially just saying that you need those TFBSs to appear in those four frequencies depending on length? So 1% of a 1000 bp sequence will consist of the third motif? When I tried: 10 bp length * 12 (how many times I saw in list) = 130 bp....but that would be ~10% of a 1,000bp pair sequence.Sequences.txt
and how it is used in siteout.py?I fixed the probability issue and explained the Sequences.txt
I created a new repo, I need to fix some things like the readme and usability issues before I'm done with this.
Hey @thethomaslane,
You can put the sequences in the Google drive and I will calculate the scores.
We need to go through the code together and start making some decisions about how to format and make into more of a universal tool.
Please get rid of all old code related to neutral_sequence_generator in this team_neural_network directory.
I added the folder control_seqs to the Google drive.
One of the goals of the project, after we get an model working enough, is to be able to feed into the model "random sequences" for the model to be able to predict function. This serves two purposes: 1. gives us sequences to try and functionally test using the MS2 system and confocal microscopy and 2. allows us to ask questions about what the model is identifying as important features. For example, if TFBS spatial order is and important feature, we could systematically scramble TFBS in a sequence and see how that model is is affected.
Some ideas on how to start
Part 1
Read about what has been tried. Please document thoughts below in this issue. Include other papers and results that you find.
A quick search yielded:
Possible way to start building a simple program
.fasta
format.