TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
10 stars 6 forks source link

Pull in all of UMLS #119

Open cbizon opened 2 years ago

cbizon commented 2 years ago

Nodenormalizer has a few roles:

  1. establishing equivalent id sets
  2. Assigning types
  3. returning labels

One problem is that people rely on it for 3, so if it can't do 1&2 for some chunk of data, then you don't get any labels.

Could we/ should we pull in all of UMLS either as 1) correctly merging everything in there into clique/types or 2) just bringing in the unmapped parts as NamedThings so that we can at least serve labels?

cbizon commented 2 years ago

We could make use of the semantic types and biolink mappings, but I would want to review those pretty carefully first.

cbizon commented 2 years ago

@colleenXu can you comment on any particular types of entities that give you the most trouble?

colleenXu commented 2 years ago

I didn't find as many examples as I expected but here are some:

SEMMEDDB has associations with outdated identifiers, so these show up as well...

EDIT: the semmeddb data files do have pipe-delimited identifiers, so basically mapping to entrez gene IDs...

colleenXu commented 2 years ago

More examples that SRI Node Normalizer doesn't fetch labels for. These map to biolink PhysiologicalProcess or MolecularActivity...

colleenXu commented 2 years ago

More examples of semmeddb semantic types already mentioned:

from other semmeddb semantic types:

colleenXu commented 2 years ago

Plant is particularly useful since it's used to annotate supplements in idisk...

cbizon commented 2 years ago

There are many of the missing UMLS that are taxa (plants, birds, etc). It looks like there are good mappings in the metathesaurus to both mesh and ncbi.

cbizon commented 2 years ago

Looking at some aapp types that don't map to other things in nodenorm at the moment, it looks like there are at least some decent mappings to meshes. We should review whether we want these mappings; I seem to recall them occasionally giving some trouble...

gaurav commented 2 years ago

Here's one way in which we could proceed:

  1. Once all the compendia are generated, we generate a final "UMLS.txt" compendium that consists of every UMLS ID from MRCONSO.RRF with a relevant semantic type (e.g. T092 "Organization" is probably not relevant to Translator, so we can just exclude all those IDs), minus any IDs that were already included in any other compendia. We can add the UMLS label and the overall semantic type mapped to a Biolink type to the compendium as well. (We could also include synonymy information from across the UMLS if that would be useful). This should meet the specific aims of this issue to improve UMLS coverage, but at the cost of having lots of concepts that aren't properly clustered to each other.
  2. We then start looking at the semantic types we have in UMLS.txt and determining if more of those entries should be included in the specific compendia, such as anatomy, cellular component, and so on. The goal would be to reduce the number of entries in UMLS.txt by ensuring that its identifiers are properly clustered elsewhere in Babel. Eventually, we should end up with an UMLS.txt that only contains the identifiers that can't be easily placed anywhere else in the compendia.

@cbizon Do you think this would work?

cbizon commented 2 years ago

Yes, I think this is a very good plan.

colleenXu commented 2 years ago

Note that there are UMLS IDs that seem to map to > 1 umls semantic type...This is an example: rosiglitazone mapped to both Organic Chemical and Pharmacologic Substance

gaurav commented 2 years ago

Sorry it took me a while to get back to this! It looks like there are 1,339,426 UMLS IDs in the latest Babel run. I found 1,037,476 UMLS IDs not already present in Babel that have a single Biolink type.

There were also 3,110,863 UMLS IDs without a Biolink type, which are:

      1  {'T007': {'Bacterium'}, 'T121': {'Pharmacologic Substance'}} -> []
      5  {'T007': {'Bacterium'}, 'T204': {'Eukaryote'}} -> []
     41  {'T021': {'Fully Formed Anatomical Structure'}} -> []
    160  {'T016': {'Human'}} -> []
    165  {'T010': {'Vertebrate'}} -> []
    516  {'T001': {'Organism'}} -> []
    565  {'T008': {'Animal'}} -> []
    891  {'T120': {'Chemical Viewed Functionally'}} -> []
   2458  {'T090': {'Occupation or Discipline'}} -> []
   6184  {'T091': {'Biomedical Occupation or Discipline'}} -> []
   7991  {'T031': {'Body Substance'}} -> []
  10514  {'T194': {'Archaeon'}} -> []
  18360  {'T167': {'Substance'}} -> []
  30162  {'T011': {'Amphibian'}} -> []
  34624  {'T014': {'Reptile'}} -> []
  41733  {'T015': {'Mammal'}} -> []
  69714  {'T005': {'Virus'}} -> []
  94786  {'T012': {'Bird'}} -> []
 114974  {'T013': {'Fish'}} -> []
 324645  {'T004': {'Fungus'}} -> []
 514514  {'T002': {'Plant'}} -> []
 550402  {'T007': {'Bacterium'}} -> []
1287458  {'T204': {'Eukaryote'}} -> []

I'll be trying to map those to Biolink types next.

There are also some UMLS IDs that have multiple Biolink types:

     49 biolink:Drug|biolink:Food
    136 biolink:Activity|biolink:Procedure
    137 biolink:Device|biolink:Drug
    698 biolink:PhysicalEntity|biolink:Publication
   2156 biolink:Drug|biolink:SmallMolecule
   4556 biolink:Agent|biolink:PhysicalEntity

I suspect that I can remove those biolink:PhysicalEntitys without affecting the interpretation of the provided UMLS concepts. For the others (e.g. C1971594 "Chantix 1 MG Oral Tablet" is both a device and a drug, while C1999759 "BZL101" is both a drug and a small molecule) I think it makes sense to categorize these concepts as both of the specified Biolink types. Does that sound right?

cbizon commented 2 years ago

This all looks good to me. I do have a few questions -

For all those ones that are different organism types (eukaryote, bird, etc), would it be feasible to go ahead and include them in the taxon concordance?

I don't really understand why C1971594 is a device?

Also, I would say that BZl101 is a small molecule but not a drug (in the biolink sense of drug) unless we are doing molecule/drug conflation.

That said, the numbers on these are small, and I'm not particularly concerned about them enough to hold this up

gaurav commented 2 years ago

This might be doable by calling umls.write_umls_ids() and sending in the identifiers we need from this missing table above.

Not worth more than a day.

gaurav commented 2 years ago
gaurav commented 2 years ago

After the fixes described above, we're now down to 36313 UMLS IDs without UMLS types and 0 UMLS IDs with multiple UMLS types. The UMLS IDs without types are:

     41 {'T021': {'Fully Formed Anatomical Structure'}} -> []
    160 {'T016': {'Human'}} -> []
    891 {'T120': {'Chemical Viewed Functionally'}} -> []
   2458 {'T090': {'Occupation or Discipline'}} -> []
   6184 {'T091': {'Biomedical Occupation or Discipline'}} -> []
   8219 {'T031': {'Body Substance'}} -> []
  18360 {'T167': {'Substance'}} -> []

The full list is available on Hatteras at /projects/babel/babel-outputs/2022sep6/reports/umls.txt.

I'll dig into this a bit more. At a quick look, it looks like the T031 Body Substance might need to be added to anatomy (it includes things like peritoneal fluid, aqueous humor, bronchoalveolar lavage fluid), but T167 Substance isn't (it includes things like elementary particles, fossils and hashish).