biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

add in NER stats from SemMedDB to the semmeddb2 API #606

Closed andrewsu closed 1 year ago

andrewsu commented 1 year ago

Now that we've created the new https://biothings.ncats.io/semmeddb2 API as part of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/569 to investigate filtering strategies to improve signal/noise, let's also join in information about the Named Entity Recognition (NER) from the PREDICATION_AUX table (https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html):

image

I can really only imagine us using the SUBJECT_TEXT AND SUBJECTSCORE values (plus the corresponding OBJECT values), so let's focus on those. We can add these values to the predication object at the same level as the predication_id:

image

colleenXu commented 1 year ago

FYI: I also saw some possible predicate info and sentence-predication confidence info:

I don't see any "score" for the relationship selection ("predicate"), but there's also a SCORE column in the ENTITY table (that seems to relate to individual sentences).

There are columns in the PREDICATION_AUX table to identify the position of the string that was used to pick the predicate (PREDICATE_START_INDEX and PREDICATE_END_INDEX).

originally posted here

erikyao commented 1 year ago

Example#1: Plasmids STIMULATES Dihydrofolate Reductase

Click me ```python { "_id": "C0032136-STIMULATES-C0039667", "predicate": "STIMULATES", "predication": [ { "predication_id": 84692642, "pmid": 360038, "sentence_id": 43528009, "sentence": "The R factor induced enzyme was partially purified from a strain carrying a multicopy recombinant plasmid into which the 1770 bp fragment was inserted and which induced high levels of dihydrofolate reductase.", "subject_text": "plasmid", "subject_score": 773, "object_text": "dihydrofolate reductase", "object_score": 1000 }, { "predication_id": 102407545, "pmid": 6407900, "sentence_id": 74212980, "sentence": "A plasmid mutation has been identified that increases expression of mouse DHFR more than ten-fold.", "subject_text": "plasmid", "subject_score": 888, "object_text": "DHFR", "object_score": 824 } ], "pmid_count": 2, "predication_count": 2, "subject": { "umls": "C0032136", "name": "Plasmids", "semantic_type_abbreviation": "bacs", "semantic_type_name": "Biologically Active Substance", "novelty": 1 }, "object": { "umls": "C0039667", "name": "Dihydrofolate Reductase", "semantic_type_abbreviation": "gngm", "semantic_type_name": "Gene or Genome", "novelty": 1 } } ```

Example#2: CDK3 gene INTERACTS_WITH activating transcription factor 1

Click me ```python { "_id": "C1332734-INTERACTS_WITH-C0214635", "predicate": "INTERACTS_WITH", "predication": [ { "predication_id": 125412171, "pmid": 18794154, "sentence_id": 120609124, "sentence": "Cyclin-dependent kinase 3-mediated activating transcription factor 1 phosphorylation enhances cell transformation.", "subject_text": "Cyclin-dependent kinase 3", "subject_score": 849, "object_text": "activating transcription factor 1", "object_score": 849 }, { "predication_id": 125412566, "pmid": 18794154, "sentence_id": 120609128, "sentence": "Furthermore, we found that cdk3 phosphorylates activating transcription factor 1 (ATF1) at serine 63 and enhances the transactivation and transcriptional activities of ATF1.", "subject_text": "cdk3", "subject_score": 1000, "object_text": "activating transcription factor 1", "object_score": 1000 } ], "pmid_count": 1, "predication_count": 2, "subject": { "umls": "C1332734", "name": "CDK3 gene", "semantic_type_abbreviation": [ "aapp", "gngm" ], "semantic_type_name": [ "Amino Acid, Peptide, or Protein", "Gene or Genome" ], "novelty": 1 }, "object": { "umls": "C0214635", "name": "activating transcription factor 1", "semantic_type_abbreviation": "aapp", "semantic_type_name": "Amino Acid, Peptide, or Protein", "novelty": 1 } } ```

Example#3: C1333570-CAUSES-C0023882

Questionable NER data:

  1. Text little should not be connected to concept Little's Disease
  2. Text PSMs (plant secondary metabolites) should not be connected to concept FOLH1 gene
    • FOLH1 gene is one of PSMA (Prostate-Specific Membrane Antigen), and this could be the cause of the mistake.
Click me ```python { "_id": "C1333570-CAUSES-C0023882", "predicate": "CAUSES", "predication": [ { "predication_id": 182378913, "pmid": 31580494, "sentence_id": 342370606, "sentence": "Ambient temperature has been shown to alter liver function in rodents and the toxicity of some PSMs, but little is known about the physiological and nutritional consequences of consuming PSMs at different ambient temperatures.", "subject_text": "PSMs", "subject_score": 827, "object_text": "little", "object_score": 1000 } ], "pmid_count": 1, "predication_count": 1, "subject": { "umls": "C1333570", "name": "FOLH1 gene", "semantic_type_abbreviation": "gngm", "semantic_type_name": "Gene or Genome", "novelty": 1 }, "object": { "umls": "C0023882", "name": "Little's Disease", "semantic_type_abbreviation": "dsyn", "semantic_type_name": "Disease or Syndrome", "novelty": 1 } } ```
erikyao commented 1 year ago

Statistics of all NER stats

STAT subject_score object_score
TOTAL 122611719 122611719
MIN 0 0
MAX 1000 1000
MEAN 927.23 922.86
MEDIAN 916 901
2.5TH PERCENTILE 766 759
25TH PERCENTILE 888 872
50TH PERCENTILE 916 901
75TH PERCENTILE 1000 1000
97.5TH PERCENTILE 1000 1000

So NER shows high confidence in the connection between the entity texts and concepts. A threshold around 800 seems weak.

erikyao commented 1 year ago

Statistics of predication list lengths (i.e. predication_count values in existing documents)

STAT predication_count
TOTAL 24481939
MIN 1
MAX 64451
MEAN 3.65
MEDIAN 1
2.5TH PERCENTILE 1
25TH PERCENTILE 1
50TH PERCENTILE 1
75TH PERCENTILE 2
97.5TH PERCENTILE 15

The documents with the max predication_count is exactly C0023884-PART_OF-C0034693 (Liver PART_OF Rattus norvegicus) which caused the BSONObjectTooLarge error to MongoDB.

If we apply a threshold of 1000 to the length of predication lists, 4,293 documents out of 24,481,939 (i.e. 0.0175%) will be affected.

erikyao commented 1 year ago

14 predication records, invovling 10 semmeddb2 documents, have NO NER stats from the source data. They are:

predication_id doc _id
201544459 C0871124-PROCESS_OF-C0008059
203800327 C0039185-STIMULATES-C2936529
201986825 C1709820-PART_OF-C0029045
192912334 N/A
196519528 N/A
196519532 N/A
198555342 C4321237-PROCESS_OF-C0008059
199170923 C0429845-USES-C1709820
192883183 C0020538-COEXISTS_WITH-C0871124
201986826 C0018207-LOCATION_OF-6098
203209176 C1422771-compared_with-C1418270
203209174 C0206131-LOCATION_OF-C1418270
205429882 N/A
201469895 C1817666-PROCESS_OF-C0024432
erikyao commented 1 year ago

@colleenXu @andrewsu semmeddb2 updated with NER stats in 4 new fields:

{
    "_id": "...",
    "predication": [
       {
            "object_score": <int>,
            "object_text": <str>,
            "subject_score": <int>,
            "subject_text": <str>
       },
       # omitted
    ]
}
andrewsu commented 1 year ago

Super, this looks great, thanks!