biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
286 stars 266 forks source link

Integration of Quikr, a new classifier #1093

Open ghost opened 11 years ago

ghost commented 11 years ago

Hello,

We am looking to contribute a new classifier to the Qiime project. We recently released Quikr (https://github.com/EESI/quikr, http://bioinformatics.oxfordjournals.org/content/29/17/2096), and think it would be good to integrate it into Qiime to offer users a faster solution for OTU classification. Quikr currently works faster than the default classifier RDP.

Quikr can train with any database using our tools, and it's trivial to generate quikr_train matrices for all Greengenes rep_set files (currently 14 files).

Currently our command line tool can generate an OTU_Table for multiple files that can be converted to BIOM format for usage with qiime https://github.com/EESI/quikr/blob/master/doc/cli.markdown). We do not get an individual sequence labeling like assign_taxonomy.py but our quick generation of an OTU table would greatly benefit users. Instead of individual sequences we classify each fasta file a whole, and return the estimated counts of each OTU in that sample file.

What would be the best way to integrate this into the Qiime workflow? Would adding another class to assign_taxonomy.py be the best route?

Thank you, Calvin Morrison EESI Lab

antgonza commented 11 years ago

It sounds really cool.

The idea of PyCogent/QIIME is to wrap the original tools and have one single interface for all methods, where the interface is a script which inputs and outputs are the same and internally it formats them to interact with the tool in question. Anyway, I think the best place to add it is in assign_taxonomy.py as a new --assignment_method. The explanation of how to add a new pycogent controller can be found in here: http://pycogent.org/examples/building_and_using_an_application_controller.html. Note that you can add the new controller to qiime/pycogent_backports for the time being and then we can transfer to pycogent.

Let us know if you have anymore questions.

ghost commented 11 years ago

Hi,

Thanks for the information, I've started looking into writing my own controller.

I think that would be a good way to integrate it, my only worry is that we do not offer sequence by sequence classification, but rather represent the estimated otu's present in the entire sample. Would that output be acceptable? Would it fit somewhere else better?

antgonza commented 11 years ago

Got it. Will it be possible to return the sequence by sequence classification? A lot of the times, for different tests, we want to remove groups of sequences (OTUs) that belong to certain taxonomic group and rerun those tests; or is there another way to do this within Quikr?

ghost commented 11 years ago

The way Quikr works, it won't be possible to return sequence by sequence classification. Quikr will give a representative of the whole file, but won't tell you the individual sequence's OTU's.

Hope that helps clarify

antgonza commented 11 years ago

Understood. The problem is how to filter the OTU table based on taxonomy, any ideas?

One more question, if you use Quikr on two sets of samples separately, two different runs, is there a way to merge/compare the results? Or do you need to rerun Quikr on the merged OTU table? Sorry if this info is in the tutorial/documentation.

ghost commented 11 years ago

Our software does not have that capability, but I can't imagine it would be difficult to implement. I think that merging two OTU Table would probably accomplished with a small python script.

ghost commented 11 years ago

Understood. The problem is how to filter the OTU table based on taxonomy, any ideas?

I'm not sure I understand the question, could you re-phrase it? thank you

rob-knight commented 11 years ago

Why do you want to do this in Quikr rather than filtering the OTU table by taxonomy as a step before input into Qiukr?

On Aug 22, 2013, at 9:08 AM, Antonio Gonzalez notifications@github.com<mailto:notifications@github.com> wrote:

Understood. The problem is how to filter the OTU table based on taxonomy, any ideas?

One more question, if you use Quikr on two sets of samples separately, two different runs, is there a way to merge/compare the results? Or do you need to rerun Quikr on the merged OTU table? Sorry if this info is in the tutorial/documentation.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1093#issuecomment-23097293.

rob-knight commented 11 years ago

My point is that OTU table manipulation should probably be done outside the classifier methods rather than being included in them, right? And this functionality already exists in qiime.

On Aug 22, 2013, at 9:19 AM, Calvin Morrison notifications@github.com<mailto:notifications@github.com> wrote:

Our software does not have that capability, but I can't imagine it would be difficult to implement. I think that merging two OTU Table would probably accomplished with a small python script.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1093#issuecomment-23098180.

antgonza commented 11 years ago

Because Quikr doesn't return the taxonomy per sequence but from the full sample, maybe the best solution is to add this functionality to Quikr ??

Questions in the forum asking about this functionality: https://www.google.com/search?ie=UTF-8&oe=UTF-8&q=filter+by+taxonomy&btnG=Search&sitesearch=groups.google.com%2Fgroup%2Fqiime-forum

rob-knight commented 11 years ago

Oh, I see the issue. Yes that would be necessary to compare with results of other classifiers. This would be more like the source tracking output in its present form right?

On Aug 22, 2013, at 10:24 AM, Antonio Gonzalez notifications@github.com<mailto:notifications@github.com> wrote:

Because Quikr doesn't return the taxonomy per sequence but from the full sample, maybe the best solution is to add this functionality to Quikr ??

Questions in the forum asking about this functionality: https://www.google.com/search?ie=UTF-8&oe=UTF-8&q=filter+by+taxonomy&btnG=Search&sitesearch=groups.google.com%2Fgroup%2Fqiime-forum

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1093#issuecomment-23104025.

antgonza commented 11 years ago

Not sure, I think is more similar to the output of summarize_taxa.py. @mutantturkey can you comment?

gailrosen commented 11 years ago

The trade-off for speed in Quikr is to analyze a whole sample and to output the percentage of that sample that belongs to each OTU (which we can then transform into the OTU table). This can speed up some forms of analysis -- like PCoA just relies on an OTU table and not sequence-by-sequence classification. So, if one can analyze the 200+ HMP samples and generate a PCoA in a few hours -- which we show in our paper -- there is an advantage to that. However, if you want sequence-by-sequence classification, we would then suggest another method, that will likely take longer. (But that is the trade-off)

So, Quikr can be of-use for and quicken some downstream comparative analyses. It would be trivial to merge OTU tables that were analyzed with the same database. Our input is raw fasta files, we then use a database like Greengenes to classify against. However, our output just gives you the fractions of each of the database sequences found in the sample, and since the whole sample solved all at once and not sequence-by-sequence, we cannot get a sequence-by-sequence classification. I guess our output could be used as an input to summarize_taxa.py.

Perhaps we can implement a quikr_multisample_to_otutable that some users may want to use to generate some comparative analyses with, rather than in assign_taxonomy.py.

hope this helps!

ghost commented 11 years ago

I think Gail is right. A completely separate script that outputs a biom/otu_table instead of integrating it as a classifier might be a better approach. What do you think?

antgonza commented 11 years ago

Something concerning about this approach is the trade off between speed and accuracy, and more when we think in small but significant biological differences. For example, in Fig 5 of your paper the huge human body site bacterial communities differences are minimized to a point that urogenital and oral look pretty similar (maybe an effect of using 91% vs 97% similarity?). Having this in mind, do you know what to expect when differences are not that big? Tests that come to mind: http://www.ncbi.nlm.nih.gov/pubmed/22832344 or http://www.ncbi.nlm.nih.gov/pubmed/19043404

Anyway, I was looking for some input/output examples in your repository, this to be able to give a more informed suggestion, but couldn't find one. Did I miss it?

gailrosen commented 11 years ago

Well -- I think that urogenital looks more like skin than oral (as the Koren et al paper shows). We didn't have room in the paper to include the PC2 vs. PC3 ... but here it is: http://tinypic.com/r/11hz49l/5. I think that view elucidates the urogenital relationship a little better (between skin/digestive/oral). Perhaps some of the closeness is related to using weighted unifrac and thresholding some insignificant counts and using unweighted unifrac would help. We did not include examples for exploring different uses of Quikr and will explore your suggestion for doing so! Thank you.

We think that Quikr could be beneficial for users to quickly explore their data, but we do note some trade-offs.

gailrosen commented 11 years ago

Spoke with Jose Clemente yesterday, and he seems open to adding Quikr as an OTU table generator. Quikr processed all 281 Twin samples in under 90 CPU-minutes and the 10 Microbial Mat samples in 23 minutes (Greengenes 94%) on a desktop.

We have done the analysis that you have asked (GG is the greengenes database version, weighted/unweighted refers to the unifrac measure used):

Microbial Mats mats Caption: PC1 vs. PC2 GG94 Unweighted 29619.fasta is in yellow. Fasta File labels are here: http://www.ncbi.nlm.nih.gov/bioproject/29795 (ascending number means deeper depth)

Twins attached twins Caption: PC 1 vs. PC2 GG91 Unweighted Obese -- Blue Lean -- Red Overweight -- Yellow

More graphs here: http://eesilab.imgur.com

Timing of Quikr on a 3.2 Ghz i5-3470 Processor which had 8GB RAM (note that user time is total CPU time while real was less because of multithreading):

Twin Study with GG91:

real 15m24.392s user 54m53.738s sys 2m55.191s

Twin Study with GG94:

real 21m7.982s user 78m46.555s sys 1m26.949s

Guerrreo Times with GG91:

real 1m15.054s user 4m3.639s sys 0m2.220s

Guerroeo Times with GG94:

real 7m22.540s user 22m57.458s sys 0m7.784s

DEPENDENCY NOTES:

While Quikr-C is mostly written in GNU C, Quikr-C does have a dependency on the MLton compiler, as a programmer wrote very fast k-mer counting code in OCaml that we use to keep this fast. We can supply QIIME with an executable for the MLton code for most Linux and OSX flavors.

best, Gail

cleme commented 11 years ago

@gailrosen I believe what we discussed was whether Quikr could be a viable alternative to RDP, and as we talked there would be three aspects to consider: (1) tradeoff time/accuracy (2) dependencies (3) integration and support. Accuracy is critical regardless of time, so what I was suggesting is that you compare taxonomic assignments between Quikr and RDP at 97% for all taxonomic levels using compare_taxa_summaries.py.

For dependencies, we prefer to reduce rather than increase them. Providing executables should be fine as long as you are willing to commit to provide those in the future as well. Finally, you would also have to commit in terms of integration of support for your tool within Qiime. @mutantturkey mentioned above he'd be ok with doing the integration himself, so the other missing piece would be support (e.g. questions in the Qiime forum, etc)

ghost commented 11 years ago

Hi!

Time/Accuracy - We are running through taxonomic assignments by RDP and Quikr with greengreen's 97%, I'll report back as soon as I have results.

Dependencies - I hacked something together this week to remove any dependencies on the MLton compiler. the Quikr binaries would be all we would need with a bit of work on our end to integrate the replacement in with Quikr (could potentially improve performance of Quikr as well)

Integration - as you said, I'd be willing to write the integration, and support it - This is our baby! We want to love and nurture it and see it be successful. Quikr is pointless without any users, so we want to get our tool out there! Keeping the Quikr integration for Qiime up to date is essential for us, because we want people to be able to easily integrate our tool into their workflow.

gregcaporaso commented 11 years ago

Since Quikr doesn't fit in perfectly with the existing QIIME workflows (it seems to combine OTU picking and taxonomy assignment, which are distinct steps in QIIME), it might be better as a stand-alone step. If it takes QIIME's split libraries seqs files as input, and generates a BIOM table as output, you could write a QIIME-like interface for it (we can provide some input on how to do that by using pyqi), keep it stand-alone, and we could have you guys add a tutorial to the QIIME website on using Quikr with QIIME. I'd then see this as fitting in in place of one of the OTU picking workflows, and by outputting BIOM it would be easy for us to compare to the existing OTU picking workflows.

Thoughts on that strategy? I'm thinking about ways to make it easy to use with QIIME, while fitting in with the current framework of how steps are broken up.

gailrosen commented 11 years ago

Hi,

You're right -- the 6mers are too low-resolution to get good results on the Twin study. So, we went to 8mers. However, we need more than 8GB of RAM (needed ~16GB) so we went to a slower server machine.

Here are the results (all numbers are in the Pearson correlation coefficient and L2-6 stands for the taxa level comparison):

Twin (281 samples) GG97 8mers took 2093 cpu-minutes on 2 GHz Intel E7-4820 processor L6 0.7206 L5 0.8903 L4 0.795 L3 0.6501 L2 0.9942

Twin (281 samples) GG97 6mers took 130 cpu-minutes on 3.2 GHz Intel i5-3470 processor
L6 -0.0647 L5 -0.0428 L4 -0.0379 L3 0.1805 L2 0.8604

Guerrero Microbial Mat (10 samples) GG97 6mers took 65 cpu-minutes on 3.2 GHz Intel i5-3470 processor
L6 0.5528 L5 0.5798 L4 0.6237 L3 0.7573 L2 0.9998

So... let us know what you think. I believe there will always be a time-accuracy trade-off... but Quikr seems sufficient for generating reasonable PCOA plots.

best, Gail

antgonza commented 11 years ago

Thanks for checking. I think doing reference based OTU picking on the Twin takes less time 2 hrs single core but I might be wrong, do you know? Note that the interesting result on this study is at alpha diversity (obese less diverse than lean) and not beta diversity.

Anyway, it sounds like the default on the script should be 8mers or more, right? Cause if you want to take a quick look to your study and you miss interesting patterns it will be a huge shame. Do you have time benchmarks using 8 or 10mers?

Sorry for all the emails and benchmarks but all the tools within QIIME have gone through a similar process, which in my opinion makes it a reliable and reproducible tool.