ENH: added @wmercurio's database identifiers for mock-6, mock-7, mock-8

@gregcaporaso These look good on spot check for all species-level taxa. The only issue I see is whether we are doing the right thing for taxa that don't have species level names, e.g., Bacteroides;Other. The accessions that William grabbed do indeed match, e.g., Bacteroides;Bacteroides_sp., but this Bacteroides sp. may not be the same as the one present in that mock community (there's a good chance it isn't, since it is clustering out in the 97% OTUs and we don't know which rep_seq it is clustered with).

Are we approaching this the right way? How are end users most likely to use these seqs? Would it be better to include the "100% OTU" seqs (i.e., from the full reference database) and keep these in the current directory: /greengenes/13_8/database-identifiers.tsv instead of eventually transferring to a file-specific directory, e.g.: /greengenes/13_8/97-otus/

Or would it be better to just exclude those accession #s if species level is not assigned? (I'd vote no)

Another possibility --- if we have OTU maps for the OTU picking used to create the 97% OTUs ref taxonomy, we could figure out which identifier matches the representative seq for the OTU into which species X's seqs were clustered. If end users will be using these ref seqs for comparisons to OTU-picked data, this may be the better approach. If not, using 100% OTUs may be best. Does that make sense?

To cover the most ground (at the expense of time), perhaps we should create both a 100% OTUs file and a 97% OTUs file.

On Thu, May 19, 2016 at 3:56 PM, Greg Caporaso notifications@github.com wrote:

@nbokulich https://github.com/nbokulich, would you mind passing through this to spot check a little bit (e.g., grab some identifiers from here, and confirm that they are associated with the right taxonomy)? We have automated checking of the file format and that the identifiers match those

in expected-taxonomy.tsv, so no need to check that.

You can view, comment on, or merge this pull request online at:

https://github.com/caporaso-lab/mockrobiota/pull/23 Commit Summary

ENH: added @wmercurio's database identifiers for mock-6, mock-7, mock-8

File Changes

A data/mock-6/greengenes/13_8/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-0 (1)

A data/mock-6/silva/119/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-1 (1)

A data/mock-7/greengenes/13_8/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-2 (1)

A data/mock-7/silva/119/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-3 (1)

A data/mock-8/greengenes/13_8/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-4 (1)

A data/mock-8/silva/119/database-identifiers.tsv https://github.com/caporaso-lab/mockrobiota/pull/23/files#diff-5 (1)

Patch Links:

https://github.com/caporaso-lab/mockrobiota/pull/23.patch

https://github.com/caporaso-lab/mockrobiota/pull/23.diff

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/caporaso-lab/mockrobiota/pull/23

caporaso-lab / mockrobiota

ENH: added @wmercurio's database identifiers for mock-6, mock-7, mock-8 #23

in expected-taxonomy.tsv, so no need to check that.