TranslatorSRI / NameResolution

A service for finding CURIEs from lexical strings.
MIT License
3 stars 2 forks source link

don't use UMLS as primary identifier for taxonomy #71

Open balhoff opened 1 year ago

balhoff commented 1 year ago

I would rather receive an NCBI taxonomy identifier in most cases. However, there are many species that aren't in NCBI, so some other source might be needed for those (GBIF or Catalog of Life?). One problem example: searching for "american goldfinch", I get this result:

[
  {
    "curie": "UMLS:C0326959",
    "label": "Carduelis tristis",
    "synonyms": [
      "Spinus tristis",
      "Fringilla tristis",
      "Carduelis tristis",
      "American goldfinch",
      "Astragalinus tristis",
      "Carduelis tristis (organism)"
    ],
    "types": [
      "biolink:OrganismTaxon",
      "biolink:NamedThing",
      "biolink:Entity"
    ]
  }
]

However the taxonomically valid name for this species is "Spinus tristis" (a synonym here). "Carduelis tristis" is a taxonomic synonym. See https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=54773&lvl=3&lin=f&keep=1&srchmode=1&unlock and https://verifier.globalnames.org/?capitalize=on&format=html&names=Carduelis+tristis

gaurav commented 1 year ago

This situation is slightly worst in Babel 2023jun29, where UMLS:C0326959 is entirely missing, because it's semantic type -- T012 -- is no longer mapped correctly into the Biolink model.

The NCBITaxon situation should be an easier fix: it looks like we're only importing "scientific name" and "synonym" (meaning taxonomic synonym, not alternate name) and ignoring "common name" and "genbank common name", which is where the common names live.

gaurav commented 1 year ago

The list of possible name_class values we can use, as of the May 1 release of NCBITaxon (I think), is:

     25     genbank acronym 
    230     blast name  
    667     in-part 
   2086     acronym 
  14641     common name 
  30328     genbank common name 
  56575     equivalent name 
  75081     includes    
 220185     type material   
 245827     synonym 
 670412     authority   
2503930     scientific name 

So we definitely want to add common name and genbank common name so that organism common names will work, and we might want to bring in equivalent name and keep synonym so we can keep synonyms (e.g. Pinus abies is a synonym of the currently accepted name, Picea abies, so we would expect both to potentially bring back the same taxonomic name). I will need to double-check the rest to make sure we don't need them. I am very surprised but pleased to see the 220,185 references to type material in here!