Making Random Sequences

iamciera commented 5 years ago

One of the goals of the project, after we get an model working enough, is to be able to feed into the model "random sequences" for the model to be able to predict function. This serves two purposes: 1. gives us sequences to try and functionally test using the MS2 system and confocal microscopy and 2. allows us to ask questions about what the model is identifying as important features. For example, if TFBS spatial order is and important feature, we could systematically scramble TFBS in a sequence and see how that model is is affected.

Some ideas on how to start

Part 1

Read about what has been tried. Please document thoughts below in this issue. Include other papers and results that you find.

A quick search yielded:

Garlic: Software program on Gitub. Can we just use this? Do we need to build something?
Realistic artificial DNA sequences as negative controls for computational genomics

Possible way to start building a simple program

These need to be in .fasta format.
Length should be a controlling factor.

Create program that generates completely random sequences. It would be useful if you could control GC content as an argument.
Create program that generates random sequences with TFBS shuffled interspersed with "random" sequences. The arguments can be the number of TFBS found in the sequence and possibly type of TFBS. This would be building off of the first part.

thethomaslane commented 5 years ago

I made a program that generates a random sequence with motifs thrown in at a certain frequency

https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/Random_Sequence_Generator.ipynb

thethomaslane commented 5 years ago

This program generates a random sequence with motifs added at certain frequencies that can be picked for each motif individually. GC content can be picked as well. Sequence in between motifs is neutral (doesn't contain motifs) and is generated by siteout.

https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/Neutral_Sequence/Neutral_Sequence_Generator.ipynb

iamciera commented 5 years ago

Awesome!! This looks great. Let's discuss it tomorrow, I think we can start generating (and testing) the sequences soon!

On Wed, Feb 13, 2019, 6:23 PM Thomas Lane <notifications@github.com wrote:

I made a program that generates a random sequence with motifs thrown in at a certain frequency

https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/Random_Sequence_Generator.ipynb

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DiscoveryDNA/team_neural_network/issues/28#issuecomment-463459924, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpvJutxPKGfucqFSutSiEz21loHrjXmks5vNMihgaJpZM4W_7JB .

iamciera commented 5 years ago

Great work @thethomaslane!

The number one thing I want to do with this is put it into its own repository. Can you move all this to a new repo call "neutral_sequence_generator"? Then we can discuss further development there.

I also talked with a Python developer here at BIDs and they think that the Siteout program can easily be converted to Python 3. I might have this done.

A few comments/questions while they are on my mind.

When creating a list of the motifs with the proportion of the motifs determined by the probabilities and length, motif_dist(motifs_test,[.001,.003,.01,.007], 1000), this essentially just saying that you need those TFBSs to appear in those four frequencies depending on length? So 1% of a 1000 bp sequence will consist of the third motif? When I tried: 10 bp length * 12 (how many times I saw in list) = 130 bp....but that would be ~10% of a 1,000bp pair sequence.
Can you explain Sequences.txt and how it is used in siteout.py?

thethomaslane commented 5 years ago

I fixed the probability issue and explained the Sequences.txt

I created a new repo, I need to fix some things like the readme and usability issues before I'm done with this.

https://github.com/thethomaslane/neutral_sequence_generator

iamciera commented 5 years ago

Hey @thethomaslane,

Let's go ahead and make random sequences to test. To start, let's do two types.
- Create 500 completely neutral, no TFBS sequences.
- Create 500 with TFBS in equal probabilty. How about .25 each PWM (MA0447.1.pfm, MA0216.2.pfm, MA0212.1.pfm, MA0049.1.pfm).

You can put the sequences in the Google drive and I will calculate the scores.

We need to go through the code together and start making some decisions about how to format and make into more of a universal tool.
Please get rid of all old code related to neutral_sequence_generator in this team_neural_network directory.

thethomaslane commented 5 years ago

I added the folder control_seqs to the Google drive.

DiscoveryDNA / team_neural_network