biothings / semmeddb

1 stars 1 forks source link

Semantic Type Matching #6

Closed erikyao closed 1 year ago

erikyao commented 1 year ago

A CUI can have multiple semantic types. E.g.

CUI concept_name semantic_type
C3539881 gamma-interferon phsu
C3539881 gamma-interferon gngm
C3539881 gamma-interferon aapp
C3539881 gamma-interferon imft

When a retired CUI is mapped to a multi-semantic-typed CUI, it's reasonable to match the semantic types for precise replacement. E.g.

CUI concept_name semantic_type :arrow_right: CUI concept_name semantic_type replacement
C0021740 Recombinant Interferon-gamma phsu :arrow_right: C3539881 gamma-interferon phsu :o:
C0021740 Recombinant Interferon-gamma gngm :arrow_right: C3539881 gamma-interferon gngm :o:
C0021740 Recombinant Interferon-gamma aapp :arrow_right: C3539881 gamma-interferon aapp :o:
C0021740 Recombinant Interferon-gamma imft :arrow_right: C3539881 gamma-interferon imft :o:
C0021740 Recombinant Interferon-gamma phsu :arrow_right: C3539881 gamma-interferon gngm :x:

However, Colleen also mentioned that some highly related semantic types should be considered as matched. E.g.

CUI concept_name semantic_type :arrow_right: CUI concept_name semantic_type replacement
C1335188 PAG gene gngm :arrow_right: C1705981 PAG1 wt Allele gngm :o:
C1335188 PAG gene aapp :arrow_right: C1705981 PAG1 wt Allele gngm :question:

The aapp :arrow_right: gngm match is also worth consideration.

Colleen and I came to the idea that:

  1. If the new CUI is multiple-semantic-typed, enable semantic type matching. E.g. (C0021740,aapp) :arrow_right: (C3539881,aapp).
  2. If the new CUI has only one semantic type, disable semantic type matching. E.g. (C1705981,aapp) :arrow_right: (C3539881,gngm)

@newgene @andrewsu do you have any idea on the matching conditions? Or shall we carry out exact matching for all replacement? How about explicitly whitelisting? Appreciate your thoughts!

Note that the replacement can occur to either subjects or objects, so the matching conditions may affect the semantic meaning of those involved predicates.

erikyao commented 1 year ago

Andrew's comment in Dec 20 Meeting:

Yao's TODO: report no. of removed predications.

erikyao commented 1 year ago

Total number of predications: $90,703,432$. This number is obtained after:

  1. Predictions with 0 novelty scores are removed
  2. Predictions with invalid subject names are removed (just a few)
  3. All piped CUIs are separated to individual predictions. E.g. C0056207|3075 will be counted as in two predictions.
  4. Predictions with deleted CUIs (according to UMLS) are removed.

N.B. Node normalizer lookups not performed yet.

Now consider predictions with replaced CUIs:

andrewsu commented 1 year ago

For posterity (and only if it's easy to generate), can you provide some summary of the 669,924 predications that would be discarded? For example, are most of those because aapp got changed to gngm or vice versa?

But regardless, I'm comfortable moving forward with the exact semtype matching...

erikyao commented 1 year ago

Top 10 discarded subject semtypes

SUBJECT_SEMTYPE SUBJECT_SEMTYPE_NAME count
hcro Health Care Related Organization 116904
fndg Finding 27473
dsyn Disease or Syndrome 24603
tisu Tissue 23088
genf Genetic Function 21068
lbpr Laboratory Procedure 18223
patf Pathologic Function 17526
gngm Gene or Genome 16133
bpoc Body Part, Organ, or Organ Component 10744
ortf Organ or Tissue Function 10507

Top 10 discarded object semtypes

OBJECT_SEMTYPE OBJECT_SEMTYPE_NAME count
patf Pathologic Function 37985
genf Genetic Function 36366
dsyn Disease or Syndrome 32800
ortf Organ or Tissue Function 26308
fndg Finding 26290
sosy Sign or Symptom 17063
gngm Gene or Genome 12314
aapp Amino Acid, Peptide, or Protein 12219
anab Anatomical Abnormality 11565
lbpr Laboratory Procedure 7657

Top 10 discarded predicates

predicate count
LOCATION_OF 221965
PROCESS_OF 83473
AFFECTS 61902
COEXISTS_WITH 37521
TREATS 33710
USES 28081
CAUSES 25475
PART_OF 23851
AUGMENTS 20763
ASSOCIATED_WITH 17609

Top 20 discarded predications (as in triples of (subject_semtype, predicate, object_semtype))

SUBJECT_SEMTYPE SUBJECT_SEMTYPE_NAME PREDICATE OBJECT_SEMTYPE OBJECT_SEMTYPE_NAME count
hcro Health Care Related Organization LOCATION_OF resa Research Activity 53532
hcro Health Care Related Organization LOCATION_OF lbpr Laboratory Procedure 31750
hcro Health Care Related Organization LOCATION_OF diap Diagnostic Procedure 23061
fndg Finding PROCESS_OF humn Human 20781
dsyn Disease or Syndrome PROCESS_OF humn Human 11115
bpoc Body Part, Organ, or Organ Component LOCATION_OF fndg Finding 9506
tisu Tissue LOCATION_OF aapp Amino Acid, Peptide, or Protein 9332
gngm Gene or Genome LOCATION_OF genf Genetic Function 9061
genf Genetic Function PROCESS_OF gngm Gene or Genome 8734
mobd Mental or Behavioral Dysfunction PROCESS_OF humn Human 7135
bpoc Body Part, Organ, or Organ Component LOCATION_OF patf Pathologic Function 6732
bpoc Body Part, Organ, or Organ Component LOCATION_OF anab Anatomical Abnormality 6377
lbpr Laboratory Procedure USES lbpr Laboratory Procedure 6187
aapp Amino Acid, Peptide, or Protein AUGMENTS ortf Organ or Tissue Function 6134
lbpr Laboratory Procedure USES aapp Amino Acid, Peptide, or Protein 5792
bpoc Body Part, Organ, or Organ Component LOCATION_OF aapp Amino Acid, Peptide, or Protein 5454
blor Body Location or Region LOCATION_OF fndg Finding 5253
topp Therapeutic or Preventive Procedure TREATS dsyn Disease or Syndrome 4388
dsyn Disease or Syndrome PROCESS_OF mamm Mammal 4315
bpoc Body Part, Organ, or Organ Component LOCATION_OF dsyn Disease or Syndrome 4149
erikyao commented 1 year ago

All hcro replacements (w/o semtype matching)

CUI1 concept_name1 semantic_type_abbreviation1 CUI2 concept_name2 semantic_type_abbreviation2
C1552516 Specialty Group hcro C0220961 UMLS Metathesaurus inpr
C1516172 Cancer Center hcro C1513817 NCI-Designated Cancer Center hcro
C0237680 Residential Care Institutions hcro C0035186 Residential Facilities hcro
C0237680 Residential Care Institutions hcro C0035186 Residential Facilities mnob
C1609437 Primary care clinic hcro C1552443 Clinic / Center - Primary Care mnob
C1609437 Primary care clinic hcro C1552443 Clinic / Center - Primary Care hcro
C0872261 repository hcro C3847505 Repository mnob
C1306377 Postoperative anesthesia care unit hcro C0034871 Recovery Room hcro
C1306377 Postoperative anesthesia care unit hcro C0034871 Recovery Room mnob
C1552447 radiology facility hcro C1610162 Radiology Clinic/Center mnob
C1552447 radiology facility hcro C1610162 Radiology Clinic/Center hcro
C0013967 Emergency Service, Hospital hcro C0562508 Accident and Emergency department hcro
C0013967 Emergency Service, Hospital hcro C0562508 Accident and Emergency department mnob
C0338036 Doctor's office hcro C0031834 Physicians' Offices hcro
C0338036 Doctor's office hcro C0031834 Physicians' Offices mnob
C1546895 GlaxoSmithKline hcro C1552903 SmithKline Beecham hcro
C1619637 Hospital Psychiatric Units hcro C0870667 Psychiatric hospital unit hcro
C1619637 Hospital Psychiatric Units hcro C0870667 Psychiatric hospital unit mnob
C1546882 NABI hcro C1552896 NABI hcro
C1546858 Abbott Laboratories hcro C1552881 Abbott Laboratories hcro
C1546873 Merieux hcro C1552891 Merieux hcro
C1512798 Institute for Cancer Prevention hcro C1140168 NCI Thesaurus inpr
C1546884 Novartis Pharmaceutical Corporation hcro C1552897 Novartis Pharmaceutical Corporation hcro
C4699045 Stroke Center hcro C1136323 Logical Observation Identifiers Names and Codes inpr
C3845566 Rehabilitation facility hcro C0034993 Rehabilitation Centers hcro
C3845566 Rehabilitation facility hcro C0034993 Rehabilitation Centers mnob
andrewsu commented 1 year ago

Great, this all looks fine to move forward. Please proceed with the update!

erikyao commented 1 year ago

Thank you for the confirmation!