RDP reference database -- does it come with a phylogenetic tree?

lfaller commented 7 years ago

Hi all,

I've been following the dada2 tutorial and used the RDP trainset 14 for taxonomy assignments.

I'd like to continue with a phyloseq analysis and also look at some phylogenetic trees. My plan was to use the RPD dataset from the tutorial, run muscle to calculate an MSA, and then run FastTree to generate a phylogenetic tree. However, the dataset is so large that muscle has been running for several days now.

Is anyone aware of a phylogenetic tree I can use for this analysis? Is there one that's ready-made?

Thanks for any advice!

benjjneb commented 7 years ago

I do not believe that RDP provides a phylogenetic tree linking all their sequences.

Can you clarify: Are you trying to build a tree on all the RDP sequences? Or on your sequences?

Building a tree yourself on all the RDP sequences would be a huge compute job.

lfaller commented 7 years ago

I would like to do a unifrac analysis with the data that I processed with dada2. I figured I would need the phylogenetic tree of the reference dataset (so RDP v14).

Is this a good approach? If not, I'd appreciate other suggestions!

benjjneb commented 7 years ago

You don't need a tree of the reference sequences, you need a tree of the sequence in your dataset.

That should be much more tractable, as there are almost certainly a lot fewer sequences in your dataset than there are in the RDP reference database.

lfaller commented 7 years ago

Thanks for the feedback!

It looks like the RDP dataset I downloaded has 10,678 sequences.

I have two metagenomics datasets that were processed separately with dada2 and then merged into a single phyloseq object:

> data1
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 8398 taxa and 18 samples ]
sample_data() Sample Data:       [ 18 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 8398 taxa by 6 taxonomic ranks ]
> data2
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 13951 taxa and 17 samples ]
sample_data() Sample Data:       [ 17 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 13951 taxa by 6 taxonomic ranks ]
> merged = merge_phyloseq(data1, data2)
> merged
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 22349 taxa and 35 samples ]
sample_data() Sample Data:       [ 35 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 22349 taxa by 6 taxonomic ranks ]

As it turns out, I end up with a larger number of taxa in the final dataset than what is in the RDP dataset. I assume there are duplicated taxa in this final dataset and it will probably be more appropriate to reach out to the phyloseq folks about how to best reduce the redundancy in the final merged dataset.

However, it also seems that data2 has more taxa than there are sequences in the RDP file. Is this plausible?

And a more general question: the two datasets are similar in that they are both soil samples from a similar environment but they did come from two different continents and were sequenced by different facilities. In order to compare them, should I process them through dada2 separately and then combine them in phyloseq like I did here, or should I process them together? I kept them separate because I figured the error rates would probably be modeled more effectively that way, but I'd be happy to hear your thoughts on this.

Thanks!

benjjneb commented 7 years ago

Ah yes, soil is very diverse so you could end up with that many ASVs in soil data.

I'm not sure the best way to get around the long alignment time issue then. You could filter sequences out by prevalence or abundance if you don't care about rare variants. You could also try one of the NAST aligners, e.g. PyNAST which I think is available through QIIME or stand-alone.

On the data processing side: You should process them separately because they were processed in different facilities, so might have quite different error profiles. Then you can merge them later, as you are doing. However, from what I see there were no overlapping ASVs between the two datasets (the ntaxa in the merged phyloseq object = ntaxa in data1 + ntaxa in data2).

Was the same primer set used in each case?

benjjneb / dada2

RDP reference database -- does it come with a phylogenetic tree? #362