daquang / YAMDA

Yet Another Motif Discovery Algorithm
MIT License
51 stars 7 forks source link

YAMDA unable to find CTCF motif in toy dataset that works with HOMER #6

Closed AvantiShri closed 5 years ago

AvantiShri commented 6 years ago

Hello @daquang,

Thank you for making your code available. We've been experimenting with YAMDA and have been running into some difficulties on relatively straightforward datasets like CTCF. To demonstrate the issues, we prepared a small dataset where the positive set had 5000 regions of length 400bp drawn from IDR optimal CTCF peaks from ENCODE, and the negative set had 5000 regions of length 400bp drawn randomly from the genome. The top discovered HOMER motif on that dataset is shown below, and a Colab notebook running HOMER on the dataset is here:

Homer motif 1:

screen shot 2018-08-05 at 3 34 02 pm

By contrast, when I run YAMDA on the same dataset (both with or without the -e option, and with and without the negative set), I get highly degenerate motifs as shown below (they are the only entries in the motifs.txt output file). A Colab notebook replicating the results is here:

With the -e option and a negative set (python YAMDA/run_em.py -e -r -i central400_pos_seqs_concat.fa -j central400_neg_seqs_concat.fa -oc yamda_test_e_pos_neg):

screen shot 2018-08-05 at 3 38 15 pm

Without the -e option and a negative set (python YAMDA/run_em.py -r -i central400_pos_seqs_concat.fa -j central400_neg_seqs_concat.fa -oc yamda_test_no_e_pos_neg):

screen shot 2018-08-05 at 3 37 54 pm

With the -e option and no negative set (python YAMDA/run_em.py -e -r -i pos_seqs_concat.fa -oc yamda_test_e_posonly):

screen shot 2018-08-05 at 3 40 24 pm

Without the -e option and no negative set (python YAMDA/run_em.py -r -i pos_seqs_concat.fa -oc yamda_test_no_e_posonly):

screen shot 2018-08-05 at 3 41 20 pm

Advice on what to do would be appreciated.

Also tagging @jsu27 and @akundaje.

daquang commented 6 years ago

I'm noticing that there are lower case sequences in your FASTA file. I recommend masking lower case letters with capital N's. I also suggest using a dinucleotide shuffled control instead of a random genome control. It's natural to try a random genome control, but I found in practice it does not work well because the binding sites tend to occur in C/G rich regions (e.g. promoters, cpg islands), which will bias motif discovery towards C/G repeats.

AvantiShri commented 6 years ago

Ok, we can try that. When the user does not provide a negative set, what kind of background is assumed?

AvantiShri commented 6 years ago

Also, can you elaborate on what the -e option means? The documentation just says Erase an entire sequence if it contains a discovered motif (default: erase individual motif occurrences). - would you recommend keeping in the -e argument for these datasets?

AvantiShri commented 6 years ago

One more question: why was there only one entry in motifs.txt? Is there some option that can be used to induce YAMDA to search for more than one motif? One would hope that if YAMDA can search for multiple motifs, it would be able to handle the case where the positive and negative sets have different C/G contents.

AvantiShri commented 6 years ago

Update: ok, I see there's an NMOTIFS option that defaults to 1. We can see what happens if that is varied. Do you have a recommended value to set it to? Also, I'm guessing that this interacts with the -e argument, in that a sequence is removed from consideration if a motif hit is found. So if searching for multiple motifs, I assume you'd suggest we leave out -e?

daquang commented 6 years ago

If you think each whole sequence has at most one motif occurrence, use -e. Otherwise, leave it out. Either way, it probably makes little difference and only matters if you intend on discovering multiple motifs. Notice how the output folder has positive and negative FASTA files. After discovering one motif, motif occurrences will be deleted from positive and negative sequences, and upon completion of the program the positive and negative sequences (with motif occurrences deleted) will be outputted. If you use -e, then the whole sequence will be removed if it has even one motif occurrence, otherwise only the individual motif occurrence gets deleted (e.g. ACCCGTGTC becomes ACNNNNGTC).

I usually set NMOTIFS to 1 because most users just use it on ChIP-seq data, which typically only have one major motif anyways. I have been able to grab multiple motifs in DGF data before. Perhaps I should put an example in the README demonstrating how to do this.

If a negative set is not used, then initial probability matrices will be generated by random probabilities, instead of derived from enriched k-mers. I do not recommend this at all. Seeding, and motif discovery in general, is much more of an art than a science. It is an attempt to make a simple story out of complicated data, which is why I typically prefer complex "black box" models these days. Nevertheless, the market demands simple interpretable models, and the market will get simple interpretable models. I was thinking at one point that I may adapt THiCweed's clustering method (https://academic.oup.com/nar/article/46/5/e29/4754463) to generate a set of strong "seed" PWMs, and then use EM to refine these seeds. I have a feeling the two algorithms will make a powerful combination.

As you can probably tell with the original MEME algorithm, MEME has a TON of heuristics. Clearly, with over twenty years of experience, they managed to collect a set of heuristics that works for a lot of users' test cases, although of course MEME will have issues for some datasets too. At this point, MEME has been modified so many times, it's hard to reproduce all of their heuristics, although I tried my best to reproduce the ones that lend themselves to mini batch training. For example, MEME has a -wnsites argument that I cannot find any information on from any of their publications, but the default parameter they chose ends up being very effective for most of the datasets they come across. I tried to implement my own version (facetiously called -fudge), but I doubt it is a perfect copy.