daquang / YAMDA

Yet Another Motif Discovery Algorithm
MIT License
51 stars 7 forks source link

not removing motifs during erase phase in -e mode or in default masking #9

Closed noamteyssier closed 5 years ago

noamteyssier commented 5 years ago

I've generated a dataset of exactly 3 motifs in 1e4 sequences, each equally as probable as each other to test this, and while YAMDA has been able to find all 3 motifs from independent runs with random seeds given, it's unable to find multiple motifs within the same run.

With the -n flag raised it continually uses the same seed (which on the plus side is exactly one of the motifs I generated).

Looking into this a bit I've found that in _erase_seqs_containing_motifs and _erase_motif_occurences none of the sites past the 2dconvolution are passing the threshold, even though the sequence is there and able to be puled with regex.

daquang commented 5 years ago

Glad to hear the word seed search is finding the correct word. Can you try setting -f to 1.0?

noamteyssier commented 5 years ago

Tried a gradient from 1.0 to 0.1, only started to see some change in motifs in the 0.7 to 1.0 range, but once again the motifs aren't being erased from the sequences so the PPMs are all incredibly similar in the output.

The random seed search is changing, but the output ppm is not.

daquang commented 5 years ago

Can you send me the FASTA file? I have a feeling the initialization might have to be slightly adjusted for protein motifs.

Also, how many motif instances are in the FASTA file?

noamteyssier commented 5 years ago

motifInfo.txt motif_3_protein.fa.gz

Should be three motifs, I attached a txt file that shows the occurrences of each motif This is also available in the fasta headers

daquang commented 5 years ago

Thanks for informing me about this. I realized the bug was due to using convolve2d instead of correlate2d from the scipy signal package. I updated the GitHub repository and used the following command line to get an output that seems more or less what you want:

python run_em.py -i motif_3_protein.fa -j motif_3_protein_shuffled.fa -oc Outputs/Protein -e -w 8 -n 3 -a protein -f 1.0

MEME version 4

ALPHABET= ACDEFGHIKLMNPQRSTVWY

strands: +

Background letter frequencies (from uniform background): A 0.050000 C 0.050000 D 0.050000 E 0.050000 F 0.050000 G 0.050000 H 0.050000 I 0.050000 K 0.050000 L 0.050000 M 0.050000 N 0.050000 P 0.050000 Q 0.050000 R 0.050000 S 0.050000 T 0.050000 V 0.050000 W 0.050000 Y 0.050000

MOTIF M0 N0

letter-probability matrix: alength= 20 w= 8 nsites= 299 E= 0 0.073333 0.043333 0.073333 0.063333 0.050000 0.036667 0.053333 0.036667 0.063333 0.053333 0.033333 0.030000 0.033333 0.050000 0.066667 0.040000 0.030000 0.073333 0.036667 0.060000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

MOTIF M1 N1

letter-probability matrix: alength= 20 w= 8 nsites= 259 E= 0 0.057692 0.050000 0.046154 0.023077 0.038462 0.065385 0.057692 0.084615 0.053846 0.038462 0.053846 0.050000 0.069231 0.042308 0.050000 0.042308 0.046154 0.034615 0.057692 0.038462 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

MOTIF M2 N2

letter-probability matrix: alength= 20 w= 8 nsites= 235 E= 0 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.046808 0.068085 0.025532 0.059574 0.059574 0.051064 0.046809 0.029787 0.034043 0.042553 0.042553 0.034043 0.055319 0.068085 0.051064 0.038298 0.042553 0.110638 0.051064 0.042553

noamteyssier commented 5 years ago

Excellent! Thanks for working with me on this.

Tested it on a larger dataset and it's giving the results I'd expect.

For some reason still needing to convert threshold calculations to cpu on lines 320 and 343 and the conv_signal (correlation signal now) to a tensor in 332 and 355. But quick fixes on my end

Cheers!

daquang commented 5 years ago

Thanks for helping me figure out the issue! Apparently pytorch keeps changing small things between versions. I've updated YAMDA for pytorch 1.0.