There is no taxon 'Caecitellus parvulus strain HFCC301'

jar398 commented 10 years ago

Reported by: Dail

This one is not in NCBI Taxonomy. The original tip label is 'Caecitellus parvulus A0665 AY827848' and the Genbank accession record maps to the taxon Caecitellus parvulus. The 'description' in the SILVA browser is 'Caecitellus parvulus strain HFCC301 18S ribosomal RNA gene, partial sequence.' which is copied from the Definition field of the Genbank record.

No amount of harvesting of the SILVA dump is going to get this information; it would have to come on the fly from Genbank (and then be parsed out so that other Genbank records for the same strain can be matched to it?...).

jar398 commented 10 years ago

this is from study 2552 http://reelab.net/phylografter/study/view/2552 but is typical of a broad class.

jar398 commented 10 years ago

The Genbank record contains /strain="HFCC301" -- maybe that's the solution to this problem! Look up the accession number to get the genbank record, then append the species name with the strain name as given by /strain=.

kcranston commented 10 years ago

So, in cases where we have an accession number, we can search for /strain in the FEATURES section of the GenBank record. Do we know that this works for a good number of cases?

jar398 commented 10 years ago

Let's look at that study... just some spot checks Achlya apiculata A0989 AJ238656 --> no /strain= but it's the only sample for that species Actinocyclus curvatulus A0276X85401 [note missing space] --> /strain="AWI 85" Amphora coffeaeformis A0174 AY485498 --> /strain="CCAP1008/1" Bacillariophyta sp. MBIC10099 A0622 AB183591 --> /strain="MBIC10099" Bellerochea malleus A0205 AF525671 --> no /strain= but not expecting one (unique in species) Blastocystis hominis A0686 AB070987 --> /strain="HJ96-1" ^ this one is important as there's major ambiguity without strain id. Blastocystis lapemi A0677 AY590115 --> no that's "Blastocystis lapemi small subunit ribosomal RNA gene, partial sequence" Blastocystis lapemi A0678 AY266471 --> no that's "Blastocystis lapemi 18S ribosomal RNA gene, partial sequence" So in this case, even though there's ambiguity at the species level, there are no strain ids. We're talking about sea snake parasites here, seems unlikely to me that they're cultured but who knows. Blastocystis sp A0766 AY135408 --> no - and highly ambiguous "Blastocystis sp. clone 2 isolation-source rat feces 18S ribosomal RNA gene, partial sequence" Is 'clone 2 isolation-source rat feces' a strain name? I have no idea. Chaetoceros gracilis A0224 AY229897 --> no but there is ambiguity (2 OTUs with this species name). The description is very long and does not suggest a strain name. There is /note="strain provided by the Microalgal Culture Library, Ocean University of China"

So it's a mixed bag. When there's a strain id, it looks reasonable. Where there isn't, then either it's the only sample for a species (no within-species strains perhaps), or the Genbank description doesn't suggest a strain name either. Probably many strains are unnamed - what do we do about these?... if there is only one Genbank deposit for the 'strain', we could identify it with the Genbank accession number, but what if that's not the case? (Not that I've found an example of that.)

hdliv commented 10 years ago

I would say another case study to look at would be this one:

http://www.reelab.net/phylografter/study/view/2739

jar398 commented 10 years ago

Excellent source. A lot of these have Genbank accession numbers, which I presume would lead to the same variety of dispositions as the other study, which I've already glossed. But many of them don't (why it's inconsistent, I don't know - I suspect it has to do with merging of OTU sets for different trees).

So a problem case is this one: Anabaena circinalis AWQC307C which does not specify a Genbank accession id. The species has an NCBI taxonomy record, Dolichospermum circinale http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=109265&lvl=3&lin=f&keep=1&srchmode=1&unlock, id 109265, which has a large number of strains under it, but not that particular strain. If I do a nuccore search for AWQC307C[strain] I get two Genbank records with /strain="AWQC307C" and that taxonid. There would be no automatic way to determine which of these records had the sequence used in the study, but we don't really need that.

In any case, if we don't have a Genbank accession id, we have the problem of parsing the strain name out of the string that is the OTU label, and this is hard in general, requiring a set of custom regexps per study.

We would indeed want to create a new OTT id for this strain. That leaves the question of assigning provenance to the id. It could be one of the Genbank records, which the curator would have to locate and select. But I think the claim is that there are thousands of these and such a high level of curator effort per taxon does not scale? Alternatively the provenance could be the first study in which the strain is encountered, or else the nuccore search string that we used (that's somewhat fragile).

On Fri, Feb 21, 2014 at 8:41 AM, Dail Laughinghouse < notifications@github.com> wrote:

I would say another case study to look at would be this one:

http://www.reelab.net/phylografter/study/view/2739

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/reference-taxonomy/issues/8#issuecomment-35730835 .

hdliv commented 10 years ago

I do not understand your comment exactly of why it is not needed. I thought that the goal was that the end product would be automatic, and I am trying to show and figure out where the glitches are.

Yes, there are thousands of these and where do we go from here since the study is the way it is and that is not changing, thus it has to be done on this side.

jar398 commented 10 years ago

I'm not sure I understand what you're saying, but where I used the word "need" I was saying that while it is easy in many cases to map automatically from the accession number to the strain name, it is not easy to map in the other direction, from strain name to a set of Genbank accession numbers for that strain. The latter comment was a distraction because it is really not at all relevant; we never have to do that, we only have to do the former, accession to strain. So pretend I never said it.

The process I currently have in mind is: from the set of original tip labels, consider two cases: (1) there is a Genbank accession number in the label, (2) there isn't. In case (1) we can by script parse out the accession number, look it up in Genbank, and get the taxon name and strain name (if there is one). If the strain isn't already in OTT we can add it. End of story. In case (2) we write rules specific to the study (since each study uses a different label syntax) that can parse out the taxon name and strain name found inside each tip label. Again, if the name isn't in OTT we can add it, assuming the rules did the right thing. This might require a quick manual scan at the end of the list of proposed strain names to make sure that junk doesn't get added to OTT.

There remains the problem of strain name normalization or liberal matching, which is doable but a major software effort. It sounds like this may be necessary if we are to be able to process these studies properly.

On Tue, Feb 25, 2014 at 8:53 AM, Dail Laughinghouse < notifications@github.com> wrote:

I do not understand your comment exactly of why it is not needed. I thought that the goal was that the end product would be automatic, and I am trying to show and figure out where the glitches are.

Yes, there are thousands of these and where do we go from here since the study is the way it is and that is not changing, thus it has to be done on this side.

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/reference-taxonomy/issues/8#issuecomment-36008827 .

hdliv commented 10 years ago

Yes, I feel the liberal matching is necessary to process these studies correctly, since we can't change the way the names are coming to us.

jar398 commented 10 years ago

If you could provide a few examples of the kind of liberal matching that would be needed, that would be very helpful.

On Tue, Feb 25, 2014 at 4:01 PM, Dail Laughinghouse < notifications@github.com> wrote:

Yes, I feel the liberal matching is necessary to process these studies correctly, since we can't change the way the names are coming to us.

— Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/reference-taxonomy/issues/8#issuecomment-36057463 .

hdliv commented 10 years ago

in previous emails and calls we have reported these. Can you look at the comments that we have posted for skype meetings or back and forth calls? I have posted examples there...

jar398 commented 10 years ago

The 'liberal matching' problem is orthogonal to the missing strain problem and now has its own issue, #18.

jar398 commented 10 years ago

Just to bring you up to date, work is in progress on getting strain names out of Genbank when an accession number is available, specifically in SILVA and in those source trees that provide accession numbers. Finishing this is probably still a few weeks away. For strain names in OTUs when there is no accession number or no strain name in Genbank that will probably be part of a future 'community taxonomy editing' system.

hdliv commented 10 years ago

thank you

kcranston commented 9 years ago

Many of these might be fixed by PR 131

OpenTreeOfLife / reference-taxonomy

There is no taxon 'Caecitellus parvulus strain HFCC301' #8