biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
286 stars 267 forks source link

Addition of three-model approach and elastic nets #2092

Closed alifar76 closed 8 years ago

alifar76 commented 9 years ago

Hello QIIME developers,

I'd like to suggest two additions in QIIME with regards to statistical analyses and like to provide some code I have written for this purpose.

In a paper published last year titled: "The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term", an interesting approach was taken to identify differential OTUs between the two groups. Specifically, since OTU data is sparse count data, there are three statistical distributions that can model it: Poisson distribution, Negative Binomial distribution and zero-inflated negative binomial distribution. This paper shared above applies these 3 regression models on a per-OTU basis and selects that model which has the lowest value using Akaike Information Criterion.

I have implemented this approach as an R-script. My script not only applies Akaike information criterion but also Bayesian Information Criterion (which is more stringent). I have a further downstream Python script that can filter the results based on a specified user information criterion and a threshold for FDR-corrected p-values. The script is available here.

The other addition I wish to propose is that of elastic nets. In a paper published by Dan Knights about 4 years ago titled: "Supervised classification of human microbiota", a number of ML methods were discussed. I have decided to develop a simple wrapper R script around the elastic net method primarily because this method offers feature selection, which is what biologists require from a given microbiome dataset. That is, given thousands of OTUs in an OTU table, which OTUs are significant in predicting the outcome of, say, a specific disease. My simple wrapper script is available here.

These two scripts are routinely used in my lab and seem to be working really well. Since both my scripts are in R, I think they can be added in the qiime/support_files/R folder. I'm not sure if you guys have provided coding guidelines for R scripts but I'd be happy to refactor my code based on your input. Currently, since my code runs via the Rscript command in terminal, I think running it via the RExecutor application controller implemented in the util.py should be fairly straight-forward.

Having said all this, I'd be more than happy to start working on the Python scripts that will allow the integration of my R scripts into QIIME. Please let me know if addition of these features in QIIME is useful and if I can start collaboration with you on this front. Thank you very much.

Best, Ali

gregcaporaso commented 9 years ago

Hi @alifar76. Thanks for your interest in contributing to QIIME, and sorry for the slow response.

Later this week we'll be posting some information about our development efforts on QIIME 2 on the QIIME blog, which will support a plugin system. It will likely make the most sense for you to develop these as plugins for QIIME 2. Over the next few months we'll be putting information together on how to do this, but one good way forward would be if you packaged your tool for distribution as stand-alone software on conda. It could then take QIIME files as input (and QIIME 1 files will be supported in QIIME 2), but wouldn't be dependent on the QIIME release cycle for updates, etc. Does that sound like an approach that would work for you?

alifar76 commented 9 years ago

Hi @gregcaporaso,

Thanks a lot for writing back. Actually, what you've suggested sounds like a great idea. In fact, I have been working on this Python/R mashup kinda pipeline, which integrates the two ideas I have proposed above. I'm currently working on adding more ML based methods into the pipeline, such as Boruta and SVM, to ultimately look for OTU signals that may distinguish between treatment groups/disease states via the principle of stacking. I think my pipeline will (hopefully) have more matured by the time QIIME 2 is released and it can then possibly be integrated as a plugin in QIIME 2.

So, thanks a lot for letting me know about it. I'll wait to hear more from you about the process of integrating stand-alone software based on conda as plugin for QIIME 2 then.

gregcaporaso commented 8 years ago

@alifar76, sounds great! The best way to get information about this going forward will be to keep an eye on the @QIIME_ Twitter account, as that's where we'll be announcing this information. More details to follow. Thanks again for your interest in contributing to QIIME!