OpenTreeOfLife / reference-taxonomy

Open Tree Reference Taxonomy (OTT) tools
BSD 2-Clause "Simplified" License
11 stars 12 forks source link

Surface Genbank accession numbers for 'uncultured' tips #28

Open jar398 opened 10 years ago

jar398 commented 10 years ago

Email from Dail:

... we continue to believe that it is essential to link a GB# to these "uncultured" tips. The GB#s are already linked in SILVA so this should (hopefully) be straight forward.

JAR: Since SILVA already provides this functionality it would seem economical to simply link out to SILVA, as the current interface does. The user can get all of the 16s accessions for a taxon with two clicks and all of the accessions with three clicks. How do you anticipate users and/or tools making use of this information?

jar398 commented 10 years ago

From Dail to opentreeoflife google group on 2014-03-26 (repeating here to keep all information on this issue in one place):

We can best capture the tremendous diversity among microbes by including the curated tips from SILVA (based on analyses of ssu-rDNA sequences). To increase the power for our users and to avoid adding 'identical names', we propose that any tip NOT marked by Genus species strain be marked as name GB#. For example, instead of many tips names "uncultured bacterium or uncultured cyanobacterium or uncultured Genus" we would end up with, for example, "Uncultured bacterium GB#AB630682" and "Uncultured bacterium GB#FR667489” (which are in two different clades). We believe this should be straightforward as GB numbers are already linked to tips in SILVA. NCBI taxon id won’t work as all 'Uncultured bacterium’ have the same taxon id.

jar398 commented 10 years ago

Some things I'm not clear on...

On average there are about 300 genbank accessions and 6 SILVA tips (= subsequences from selected accessions) for each taxon in the reference taxonomy. Which genbank accession number should be shown when there is more than one?

I think users on seeing the accession number will think it's the only one for the group, so we will need to figure out some way in the user interface to make it clear what's going on. The current user interface shows accession numbers chosen at random from within the group, and the way it's shown now needs to be changed to prevent confusion.

Or is the proposal to create a reference taxonomy tip for every Genbank accession or SILVA tip? Currently the SILVA tips are grouped together by their NCBI taxon ids (the NCBI taxa are not themselves in SILVA), following the strategy invented by Jessica, and each of those groups becomes a tip in the reference taxonomy.

By the way maybe it doesn't matter but I'm not sure what you mean by 'curated tip'. What I would call the 'tips' in SILVA are sequences that, as I understand it, are mechanically harvested from Genbank, thus not curated. SILVA's taxonomic curation occurs at the next level up, which usually is at roughly the genus level. The level of NCBI taxa (species and strains and 'uncultured' containers) that we use for reference taxonomy tips is in between these two levels and is not part of the SILVA tree, it is something Jessica's script interpolates.

(I acknowledge that the 'uncultured' pseudo-taxa are currently hidden, that's a separate issue, #27, that a few of us have been trying to work through.)

mtholder commented 10 years ago

Just chiming in to agree with @jar398 's point that "we will need to figure out some way in the user interface to make it clear what's going on." It seems like we want some stub in the taxonomy to represent the OTU rather than just a sequence accession number. Until such a point as we recognized gene trees as distinct from the species tree, it seems like we want the taxonomy and the trees to have OTU.

That doesn't mean that we can't transform Genus+species+GBAccessionNumber into some form like Genus+species+"exemplified by"+GBAccessionNumber or Genus+species+" including "+GBAccessionNumber, or Genus+species+"represented by"+GBAccessionNumber But I do worry that it will be confusing to users if we appear to be presenting a gene tree for parts of the tree of life.

blackrim commented 10 years ago

I would say that I agree. Obviously I think this is only relevant to the microbial portion and possibly works because of defacto standards, but it would be good to be clear in the user interface what is going on. On Mon, Mar 31, 2014 at 11:39:54AM -0700, Mark T. Holder wrote:

Just chiming in to agree with @jar398 's point that "we will need to figure out some way in the user interface to make it clear what's going on." It seems like we want some stub in the taxonomy to represent the OTU rather than just a sequence accession number. Until such a point as we recognized gene trees as distinct from the species tree, it seems like we want the taxonomy and the trees to have OTU.

That doesn't mean that we can't transform Genus+species+GBAccessionNumber into some form like Genus+species+"exemplified by"+GBAccessionNumber or Genus+species+" including "+GBAccessionNumber, or Genus+species+"represented by"+GBAccessionNumber But I do worry that it will be confusing to users if we appear to be presenting a gene tree for parts of the tree of life.


Reply to this email directly or view it on GitHub: https://github.com/OpenTreeOfLife/reference-taxonomy/issues/28#issuecomment-39125061

Dr. Stephen A. Smith http://blackrim.org Assistant Professor, Dept. Ecology and Evolutionary Biology University of Michigan 2071A Kraus Natural Science Building 830 North University Ann Arbor, MI 48109-1048

jar398 commented 9 years ago

I favor the plan of treating SILVA clusters as taxa and adding them to OTT. The labels on the clusters should make this clear, e.g. SC-AB12345 for the silva cluster whose reference sequence is genbank accession AB12345. The label could carry an NCBI taxon and/or strain name as well.

jar398 commented 9 years ago

I expect this will be handled as part of #123