biocore / biom-format

The Biological Observation Matrix (BIOM) Format Project
http://biom-format.org
Other
89 stars 95 forks source link

Possible to represent relative abundance of taxonomy rather than OTUs? #947

Closed wwood closed 3 months ago

wwood commented 5 months ago

Hi,

I'm considering adding an option to SingleM to allow BIOM format as an output.

The problem is that I'm not sure about what the canonical way to do this is. The --taxonomic-profile output of SingleM currently produces a sparse 3 column TSV like this:

sample  coverage    taxonomy
ERR1914274  3.16    Root; d__Bacteria
ERR1914274  0.06    Root; d__Bacteria; p__Pseudomonadota; c__Gammaproteobacteria

So there is no "OTUs" as such (at least for this type of output) - it is just the estimated genome coverage of each lineage (which does not include the coverage of descendent lineages).

I considered using the taxons as the observation IDs, but if so was left with 2 questions:

  1. Should the coverage of a taxon include the coverage of its descendents? i.e. in the above should the entry for Root; d__Bacteria be 3.16 or 3.16+0.06=3.22 ?
  2. Relatedly, should the implied coverage of missing taxons be recorded? i.e. in the above should there be an observation recorded for Root; d__Bacteria; p__Pseudomonadota ?

Representing abundances of taxons is a pretty common usage e.g. kraken etc, but is complicated by the hierarchical nature of the observations. Bonus points if the schema of the taxonomy should be some how represented i.e. is there some way to record that the taxonomies are derived from GTDB R214 ?

Thanks, ben

wasade commented 5 months ago

Hey @wwood! Been a bit, hope you're well :)

It's really up to you on how to structure this, and what will be most meaningful of a representation of the data for users of SingleM. BIOM as a format doesn't care if the entries are hierarchical or not. If you want to encode the taxonomy, it could be done via group metadata -- just represent it as a Newick string. I'm not aware of packages actually using the group metadata though. And, in the case of QIIME 2, it ignores sample and observation metadata anyway as in that framework those entities are under the semantic types of Metadata and FeatureData[Taxonomy] respectively.

wasade commented 3 months ago

Hi @wwood, I'm closing this issue as I'm unsure how to address the concerns. Please reopen if needed

wwood commented 3 months ago

Hi @wasade sorry for the lack of response here. Your reply makes total sense, though I might be lazy and wait for other tools to support it, so there's an established structure to work with.

Congratulations on gg2 btw, we are definitely making use, trying to bridge the amplicon genome gap. A taxonomy update would be most welcome if there is one coming?

wasade commented 3 months ago

Hi @wwood, no worries and thanks! A taxonomy update is in the works. It would be nice to sync up sometime, any chance you could ping me at damcdonald@ucsd.edu?