KoslickiLab / DiversityOptimization

Minimizing biological diversity to improve microbial taxonomic reconstruction
MIT License
0 stars 0 forks source link

Convert 16S output format to BIOM #7

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

Goal: write a script/method that transforms the x vector returned by MinDivLP.py into a format used by biologists. For 16S data, this is the BIOM format.

This will require a few design decisions: First, the BIOM format spec is based around OTU counts, while the diversity optimization approach reconstructs abundances, not counts. Identifying OTU's should not be an issue since in the currently used training database the individual sequences are considered OTU's (one sequence = one OTU). A decision will need to be made in how to convert between percentages to counts (we could do something arbitrary such as multiplying by a fixed number for now until we have a better way to do it).

Secondly, the diversity optimization approach is mainly used to reconstruct taxonomic profiles, not OTU counts. There are conversion methods that convert from BIOM to a taxonomic profile.

As such, we might consider:

  1. Work on converting x to the BIOM format
  2. Skip the BIOM format, and utilize the associated taxonomic information to the training database) to directly create a taxonomic profile in the same format as mentioned here.

Option 2 might be the most straightforward.

dkoslicki commented 4 years ago

@cmcolbert I've updated this with more details about how to proceed on this. LMK if/when questions arise.

dkoslicki commented 4 years ago

And forgot to mention, here is the official GreenGenes data. See the README for what each file is. I've been using an older version for the training data, but that shouldn't matter much as we're just trying to get the components working first, and worry about the "correct" training databases and taxonomy later.

dkoslicki commented 4 years ago

@cmcolbert Can you also investigate option 2 above (skipping straight to the taxonomic profile)? Given the conversion to OTU counts from relative abundances is a bit ad-hoc, it would be good to have both in hand. I would suggest investigating the links above, seeing the output format of the BIOM conversion to taxonomy (should look like a text file similar to the one near the bottom of this page).

cmcolbert commented 4 years ago

@dkoslicki Absolutely; I am investigating it currently. Is there any preference between the two formats, having a mapping file or including taxonomic identifier as the name?

Also, for the associated taxonomic information, should I be using the tree file linked, or was it supposed to be the taxonomy text file?

dkoslicki commented 4 years ago

@cmcolbert

Is there any preference between the two formats, having a mapping file or including taxonomic identifier as the name?

I assume having the taxonomic identifier (taxID) is preferable, but it shouldn't be too difficult to have both (and it just depends on what QIIME2 is going to spit out). So whichever sets you up well for the comparison to QIIME2 ala #9.

Also, for the associated taxonomic information, should I be using the tree file linked, or was it supposed to be the taxonomy text file?

I assume it will be easier to use the taxonomy text file you linked, rather than trying to parse a tree.

And here again, be sure to set things up such that we can drop in a different taxonomy mapping file if we need to (eg. if we find that QIIME2 is using gg_13_9 instead of gg_13_8 or something like that)