16S reference database effects

biocore / emp

Code repository of the Earth Microbiome Project.

http://www.earthmicrobiome.org

BSD 3-Clause "New" or "Revised" License

158 stars 68 forks source link

16S reference database effects #46

Closed cuttlefishh closed 7 years ago

cuttlefishh commented 8 years ago

Choice of 16S database affects results to some degree. Main choices are RDP, SILVA, and Greengenes.

Which is more representative for environmental microbes, for host-associated microbes?
How do the results (downstream analyses) change with different databases?
Silva has better representation -- Greengenes team are working to update accordingly.

rob-knight commented 8 years ago

Note that we ran a lot of samples with old and new primers to compare results, is in the mSystems paper led by Embriette

On Jan 21, 2016, at 9:22 AM, Luke Thompson notifications@github.com wrote:

Choice of 16S database affects results to some degree. Main choices are RDP, SILVA, and Greengenes.

Which is more representative for environmental microbes, for host-associated microbes? How do the results (downstream analyses) change with different databases? Silva has better representation -- Greengenes team are working to update accordingly. — Reply to this email directly or view it on GitHub https://github.com/biocore/emp/issues/46.

cuttlefishh commented 8 years ago

Thanks! Link to paper: http://msystems.asm.org/mSystems.00009-15-abstract.php

colinbrislawn commented 8 years ago

Greengenes is developed internally, right? First by Todd Desantis, now by @wasade and others.

What is the ROI of maintaining a database, rather than using an off-the-shelf one? I wasn't around when greengenes launched so I don't know the original motivation for it's creation, or how the field has changed since then.

Some folks really like the SLIVA alignment, although that alignment may not matter as much for us.

wasade commented 8 years ago

SILVA does not construct a de novo phylogeny on each release and instead uses parsimony insertion via ARB to insert new sequences. The effect is that the SILVA is not as well suited to characterizes candidate phyla.

The Greengenes Consortium includes Rob, Phil Hugenholtz, Todd and I right now. We are very interested in expanding out development effort. The fundamental limitations right now are that we do not have centralized infrastructure in place, and developer support is thin. There is an open and in progress RFC about the Greengenes infrastructure if you'd like to contribute though.

colinbrislawn commented 8 years ago

So these things hold us back from adopting silva:

Newest version not available for qiime, yet.
- Small problem. Someone (who?) is fixing this.
Poor support for candidate taxa.
- Medium problem for OTU picking. May not be as bad as we think.
- Small problem for taxonomy assignment (for de novo methods).
Unknown / poorly described process for ARB parsimony insertion.
- Big problem. We don't believe in magic.

Anything else? If we could address all these issues, would we be comfortable switching to silva? I'm not sure how the community feels about this... Comments welcome.

cuttlefishh commented 8 years ago

@ekopylova Would you be interested in comparing the amount of novel diversity as identified by the closed-reference OTU picking to GG and Sliva? This would be per sample based on the number of sequences mapped to Greengenes v.13.8 97% and Silva v.123 97%. Basically we want to know two things:

Which database allows us to map more reads?
How much novel diversity per sample? (Then we'll group these by sample type, feed into niche modeling, etc.)

ekopylova commented 8 years ago

When do you need to have this done by?

cuttlefishh commented 8 years ago

Within one week would be great as some other things depend on this. Please write down what you did so we can easily insert into the methods for the paper. Thanks!

ekopylova commented 8 years ago

Not sure I have the bandwidth during this week. If end of next week is possible then I can.

ekopylova commented 8 years ago

I can look at this now, @cuttlefishh anyone else working on?

rob-knight commented 8 years ago

This is important so if you can find bandwidth I'd appreciate it...thanks!

On Jun 12, 2016, at 6:02 PM, Evguenia Kopylova notifications@github.com wrote:

Not sure I have the bandwidth for that during this week. If end if next week is possible then I can.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/emp/issues/46#issuecomment-225471092, or mute the thread https://github.com/notifications/unsubscribe/AB69KVbvzaqLhb8hrLlwi3u_3GPp-juJks5qLKwUgaJpZM4HJsRU .