OpenTreeOfLife / opentree

Opentree browsing and curation web site. For overarching or cross-repo concerns, please see the 'germinator' repo.
http://tree.opentreeoflife.org/
BSD 2-Clause "Simplified" License
108 stars 26 forks source link

If genbank id is in original label, offer to map to appropriate taxon #575

Open jar398 opened 9 years ago

jar398 commented 9 years ago

E.g. if the label is 'Rana sp. nov. G12345', look up G12345 in our genbank table to get appropriate SILVA or NCBI taxon, and suggest that taxon. (which we hope is not 'Rana sp. nov.')

jimallman commented 9 years ago

The most straightforward way to implement this is in the existing TNRS. Your description above suggests that the SILVA-matching behavior would

Correct? I'd propose as well that

I'll take a look at the TNRS code to see if I can sort this out...

jimallman commented 9 years ago

look up G12345 in our genbank table to get appropriate SILVA or NCBI taxon

@jar398, where is this table?

jimallman commented 9 years ago

Here are my notes from a little digging in the TNRS code (in case it's useful to someone else)..

General approach: Should we add genbank ID matching to the existing contextQuery API method? Or run it in parallel as a separate service and attempt to combine results in the curation UI? So far, I'm assuming the former.

There are two versions of TNRS. I gather tnrs_v2.java is not being used, so I'm focused on TNRS.java instead.

I'm inclined to code the ID-matching behavior in a general way, so we can support other ID types.

Is ID detection+matching always enabled in TNRS, or should we add a flag to arguments?

Is this a use case for the idStrings argument (passing candidate IDs detected by client-side regex)? It seems this argument assumes 1:1 pairs of names to IDs, so probably not.

The sensible place to call ID detection+matching is probably here, maybe just before calling getExactNameMatches.

Should genbank (or other external ID) matching ignore the (explicit or implicit) search context?

Should SILVA or other matches reset the LICA?

jar398 commented 9 years ago

On Wed, Feb 11, 2015 at 2:06 PM, Jim Allman notifications@github.com wrote:

Here are my notes from a little digging in the TNRS code (in case it's useful to someone else)..

General approach: Should we add genbank ID matching to the existing contextQuery API method? Or run it in parallel as a separate service and attempt to combine results in the curation UI? So far, I'm assuming the former.

We wouldn't have to put this table lookup in taxomachine at all. It's really a terribly simple operation - a file with two columns, and a lookup in a trivial table. It's a matter of what's easiest to implement, and what's easiest to deploy. (e.g. do we want to make a new TNRS database every time the Genbank table changes?) It doesn't really have anything to do with taxonomies so it's not clear that the TNRS is the right place. Also remember that we can't rely on help from Cody...

There are two versions of TNRS. I gather tnrs_v2.java https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/org/opentree/taxonomy/plugins/tnrs_v2.java is not being used, so I'm focused on TNRS.java https://github.com/OpenTreeOfLife/taxomachine/blob/b11c83b59ad94a138dceefb81e412d1754905155/src/main/java/org/opentree/taxonomy/plugins/TNRS.java instead.

I'm inclined to code the ID-matching behavior in a general way, so we can support other ID types.

  • add a regex for each type of ID (GenBank, NCBI taxa, what else?)
  • search for any matches for detected IDs; include all matched taxa in results and mark them accordingly (e.g. "exact match by GenBank ID")

I have never seen any other kind of id other than internal codes we'd have no chance of resolving. I have seen NCBI taxon ids in spreadsheets attached as supplementary material, but not in tip labels. Also the chance of a false hit for NCBI taxonomy is too great (they are just integers).

We might come across EMBL ids but I've never seen one and we don't have a way to look them up.

Is ID detection+matching always enabled in TNRS, or should we add a flag to arguments https://github.com/OpenTreeOfLife/taxomachine/blob/b11c83b59ad94a138dceefb81e412d1754905155/src/main/java/org/opentree/taxonomy/plugins/TNRS.java#L240 ?

I had thought the UI would have a regular expression that pulled out the genbank id, and it would pass the genbank id to a genbank id looker-upper, and the rest of the label to the TNRS. I.e. the TNRS wouldn't be responsible for the parsing.

There is a web page at NCBI listing all the possible Genbank id formats.

Is this a use case for the idStrings argument (passing candidate IDs detected by client-side regex)? It seems this argument assumes 1:1 pairs of names to IDs, so probably not.

The sensible place to call ID detection+matching is probably here https://github.com/OpenTreeOfLife/taxomachine/blob/2c6813fdeba0caa6edab7b7436f71dc3e3e91aa4/src/main/java/org/opentree/tnrs/queries/MultiNameContextQuery.java#L189, maybe just before calling getExactNameMatches.

Should genbank (or other external ID) matching ignore the (explicit or implicit) search context?

Should SILVA or other matches reset the LICA https://github.com/OpenTreeOfLife/taxomachine/blob/2c6813fdeba0caa6edab7b7436f71dc3e3e91aa4/src/main/java/org/opentree/tnrs/queries/MultiNameContextQuery.java#L447 ?

This feature is independent of SILVA so not sure what you're asking. And the LICA gets reset when labels match OTT ids, not before or after. I don't see why there is any interaction since the Genbank matches would happen during OTU mapping, and LICA computation would happen after OTU mapping, I would think.

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/575#issuecomment-73942613 .

kcranston commented 8 years ago

Issue #780 is related to Jim's comment about supporting other identifiers in the label.