caporaso-lab / mockrobiota

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.
http://mockrobiota.caporasolab.us
BSD 3-Clause "New" or "Revised" License
77 stars 35 forks source link

Specify reference taxonomy files (e.g., %OTU ID) used for annotation of expected-taxonomy.tsv files #22

Closed nbokulich closed 7 years ago

nbokulich commented 8 years ago

Expected composition (expected-taxonomy.tsv) files need not only match the database and version, but the exact ref taxonomy file that is used for taxonomy assignment of observed data. In other words, if using 97 OTUs for taxonomy assignment, a 97 OTUs expected taxonomy file must be generated (that's what we have now). If 99 OTUs, 99 OTU expected taxonomy, etc.

Perhaps we should include this information somewhere. Any ideas how/where to do this? Perhaps changing the directory structure to: database-name/version/OTU% or database-name/version-OTU%

One issue with specifying this in the directory name is 1) the name can be ambiguous (e.g., "97" is not very specific) and 2) OTU %ID may not be the only difference between file types (e.g,. if using a curated subset of reference seqs), and is marker-gene ref db specific, e.g., does not apply to metagenome ref dbs. We will need to be very descriptive (e.g., "97-otus" instead of "97") for filenames or perhaps add a README file to the directory? READMEs could get cumbersome.