Closed noamteyssier closed 5 years ago
Glad to hear the word seed search is finding the correct word. Can you try setting -f to 1.0?
Tried a gradient from 1.0 to 0.1, only started to see some change in motifs in the 0.7 to 1.0 range, but once again the motifs aren't being erased from the sequences so the PPMs are all incredibly similar in the output.
The random seed search is changing, but the output ppm is not.
Can you send me the FASTA file? I have a feeling the initialization might have to be slightly adjusted for protein motifs.
Also, how many motif instances are in the FASTA file?
motifInfo.txt motif_3_protein.fa.gz
Should be three motifs, I attached a txt file that shows the occurrences of each motif This is also available in the fasta headers
Thanks for informing me about this. I realized the bug was due to using convolve2d instead of correlate2d from the scipy signal package. I updated the GitHub repository and used the following command line to get an output that seems more or less what you want:
python run_em.py -i motif_3_protein.fa -j motif_3_protein_shuffled.fa -oc Outputs/Protein -e -w 8 -n 3 -a protein -f 1.0
MEME version 4
ALPHABET= ACDEFGHIKLMNPQRSTVWY
strands: +
Background letter frequencies (from uniform background): A 0.050000 C 0.050000 D 0.050000 E 0.050000 F 0.050000 G 0.050000 H 0.050000 I 0.050000 K 0.050000 L 0.050000 M 0.050000 N 0.050000 P 0.050000 Q 0.050000 R 0.050000 S 0.050000 T 0.050000 V 0.050000 W 0.050000 Y 0.050000
MOTIF M0 N0
letter-probability matrix: alength= 20 w= 8 nsites= 299 E= 0 0.073333 0.043333 0.073333 0.063333 0.050000 0.036667 0.053333 0.036667 0.063333 0.053333 0.033333 0.030000 0.033333 0.050000 0.066667 0.040000 0.030000 0.073333 0.036667 0.060000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
MOTIF M1 N1
letter-probability matrix: alength= 20 w= 8 nsites= 259 E= 0 0.057692 0.050000 0.046154 0.023077 0.038462 0.065385 0.057692 0.084615 0.053846 0.038462 0.053846 0.050000 0.069231 0.042308 0.050000 0.042308 0.046154 0.034615 0.057692 0.038462 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
MOTIF M2 N2
letter-probability matrix: alength= 20 w= 8 nsites= 235 E= 0 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.046808 0.068085 0.025532 0.059574 0.059574 0.051064 0.046809 0.029787 0.034043 0.042553 0.042553 0.034043 0.055319 0.068085 0.051064 0.038298 0.042553 0.110638 0.051064 0.042553
Excellent! Thanks for working with me on this.
Tested it on a larger dataset and it's giving the results I'd expect.
For some reason still needing to convert threshold calculations to cpu on lines 320 and 343 and the conv_signal (correlation signal now) to a tensor in 332 and 355. But quick fixes on my end
Cheers!
Thanks for helping me figure out the issue! Apparently pytorch keeps changing small things between versions. I've updated YAMDA for pytorch 1.0.
I've generated a dataset of exactly 3 motifs in 1e4 sequences, each equally as probable as each other to test this, and while YAMDA has been able to find all 3 motifs from independent runs with random seeds given, it's unable to find multiple motifs within the same run.
With the -n flag raised it continually uses the same seed (which on the plus side is exactly one of the motifs I generated).
Looking into this a bit I've found that in _erase_seqs_containing_motifs and _erase_motif_occurences none of the sites past the 2dconvolution are passing the threshold, even though the sequence is there and able to be puled with regex.