bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement
26 stars 8 forks source link

OTUs missing in edge_data file #91

Closed cwatt closed 2 years ago

cwatt commented 2 years ago

I've noticed that PAPRICA sometimes drops OTUs from the edge_data file, most commonly in the Acidobacteria phylum. The OTUs are included in the placements file, but their subtree is designated as "phylum_reps" and they often (but not always) do not have map_ratio or map_id values. Not all Acidobacteria are excluded in this way, though, and I have no idea what might be causing this behavior. I attached an example file (the problematic OTUs are at the end).

I haven't noticed this in previous versions of PAPRICA that I've used, so I think it might be a new issue? Thanks for your help! subset_rep-seqs.bacteria.combined_16S.bacteria.tax.placements.csv

bowmanjeffs commented 2 years ago

Thanks Cassandra for checking in on this. I think that in this case paprica is performing as expected. If reads don't place definitively to a phylum (or subphylum for some well-represented taxa) in the first round of phylogenetic placement than they fall out with the phylum_reps designation and will not appear in the edge_data files, etc. Couple questions: 1) how do you know the reads are Acidobacteria, 2) would you mind sharing a selection of your Acidobacteria reads? If those are truly Acidobacteria, and they aren't placing to that phylum, then we need to adjust the Acidobacteria reference sequences.

bowmanjeffs commented 2 years ago

PS... you are correct in noting that this didn't happen with previous versions of paprica. The two-step phylogenetic placement process massively improves accuracy (via much improved reference trees) but comes with the caveat that some odd things don't find a home in the first round.

cwatt commented 2 years ago

Hi Jeff, I appreciate how responsive you are to issues! I've attached a file with the sequences and the taxonomies that my classifier chose (SILVA v138 pre-built classifier trained on scikit learn 0.24.1). However, when I BLAST these, they end up matching to a lot of uncultured clones, some of which are Acidobacteria, others not.

missed_OTUs.csv

bowmanjeffs commented 2 years ago

Thanks Cassandra, I'll take a look, but what you said is consistent with my interpretation. These are reads that are at best weakly associated with the Acidobacteria (or any other phylum) refs. In that case neither our tool nor any other is going to give you a very reliable metabolic inference. Note that those reads aren't dropped from the final ASV table though. In cases like this my group typically leaves the classification at the domain level (i.e. "bacteria") and focuses instead on the ASV distribution pattern for hypothesis testing. Whether that works for you of course depends on your analysis goals.

cwatt commented 2 years ago

That makes sense, and I agree that this is likely an expected result of hard to place taxa. Thank you!