Open cbizon opened 2 years ago
We could make use of the semantic types and biolink mappings, but I would want to review those pretty carefully first.
@colleenXu can you comment on any particular types of entities that give you the most trouble?
I didn't find as many examples as I expected but here are some:
SEMMEDDB has associations with outdated identifiers, so these show up as well...
EDIT: the semmeddb data files do have pipe-delimited identifiers, so basically mapping to entrez gene IDs...
More examples that SRI Node Normalizer doesn't fetch labels for. These map to biolink PhysiologicalProcess or MolecularActivity...
More examples of semmeddb semantic types already mentioned:
from other semmeddb semantic types:
Plant is particularly useful since it's used to annotate supplements in idisk...
There are many of the missing UMLS that are taxa (plants, birds, etc). It looks like there are good mappings in the metathesaurus to both mesh and ncbi.
Looking at some aapp types that don't map to other things in nodenorm at the moment, it looks like there are at least some decent mappings to meshes. We should review whether we want these mappings; I seem to recall them occasionally giving some trouble...
Here's one way in which we could proceed:
@cbizon Do you think this would work?
Yes, I think this is a very good plan.
Note that there are UMLS IDs that seem to map to > 1 umls semantic type...This is an example: rosiglitazone mapped to both Organic Chemical and Pharmacologic Substance
Sorry it took me a while to get back to this! It looks like there are 1,339,426 UMLS IDs in the latest Babel run. I found 1,037,476 UMLS IDs not already present in Babel that have a single Biolink type.
There were also 3,110,863 UMLS IDs without a Biolink type, which are:
1 {'T007': {'Bacterium'}, 'T121': {'Pharmacologic Substance'}} -> []
5 {'T007': {'Bacterium'}, 'T204': {'Eukaryote'}} -> []
41 {'T021': {'Fully Formed Anatomical Structure'}} -> []
160 {'T016': {'Human'}} -> []
165 {'T010': {'Vertebrate'}} -> []
516 {'T001': {'Organism'}} -> []
565 {'T008': {'Animal'}} -> []
891 {'T120': {'Chemical Viewed Functionally'}} -> []
2458 {'T090': {'Occupation or Discipline'}} -> []
6184 {'T091': {'Biomedical Occupation or Discipline'}} -> []
7991 {'T031': {'Body Substance'}} -> []
10514 {'T194': {'Archaeon'}} -> []
18360 {'T167': {'Substance'}} -> []
30162 {'T011': {'Amphibian'}} -> []
34624 {'T014': {'Reptile'}} -> []
41733 {'T015': {'Mammal'}} -> []
69714 {'T005': {'Virus'}} -> []
94786 {'T012': {'Bird'}} -> []
114974 {'T013': {'Fish'}} -> []
324645 {'T004': {'Fungus'}} -> []
514514 {'T002': {'Plant'}} -> []
550402 {'T007': {'Bacterium'}} -> []
1287458 {'T204': {'Eukaryote'}} -> []
I'll be trying to map those to Biolink types next.
There are also some UMLS IDs that have multiple Biolink types:
49 biolink:Drug|biolink:Food
136 biolink:Activity|biolink:Procedure
137 biolink:Device|biolink:Drug
698 biolink:PhysicalEntity|biolink:Publication
2156 biolink:Drug|biolink:SmallMolecule
4556 biolink:Agent|biolink:PhysicalEntity
I suspect that I can remove those biolink:PhysicalEntity
s without affecting the interpretation of the provided UMLS concepts. For the others (e.g. C1971594 "Chantix 1 MG Oral Tablet" is both a device and a drug, while C1999759 "BZL101" is both a drug and a small molecule) I think it makes sense to categorize these concepts as both of the specified Biolink types. Does that sound right?
This all looks good to me. I do have a few questions -
For all those ones that are different organism types (eukaryote, bird, etc), would it be feasible to go ahead and include them in the taxon concordance?
I don't really understand why C1971594 is a device?
Also, I would say that BZl101 is a small molecule but not a drug (in the biolink sense of drug) unless we are doing molecule/drug conflation.
That said, the numbers on these are small, and I'm not particularly concerned about them enough to hold this up
This might be doable by calling umls.write_umls_ids() and sending in the identifiers we need from this missing table above.
Not worth more than a day.
After the fixes described above, we're now down to 36313 UMLS IDs without UMLS types and 0 UMLS IDs with multiple UMLS types. The UMLS IDs without types are:
41 {'T021': {'Fully Formed Anatomical Structure'}} -> []
160 {'T016': {'Human'}} -> []
891 {'T120': {'Chemical Viewed Functionally'}} -> []
2458 {'T090': {'Occupation or Discipline'}} -> []
6184 {'T091': {'Biomedical Occupation or Discipline'}} -> []
8219 {'T031': {'Body Substance'}} -> []
18360 {'T167': {'Substance'}} -> []
The full list is available on Hatteras at /projects/babel/babel-outputs/2022sep6/reports/umls.txt.
I'll dig into this a bit more. At a quick look, it looks like the T031 Body Substance might need to be added to anatomy (it includes things like peritoneal fluid, aqueous humor, bronchoalveolar lavage fluid), but T167 Substance isn't (it includes things like elementary particles, fossils and hashish).
Nodenormalizer has a few roles:
One problem is that people rely on it for 3, so if it can't do 1&2 for some chunk of data, then you don't get any labels.
Could we/ should we pull in all of UMLS either as 1) correctly merging everything in there into clique/types or 2) just bringing in the unmapped parts as NamedThings so that we can at least serve labels?