add in NER stats from SemMedDB to the semmeddb2 API

andrewsu commented 1 year ago

Now that we've created the new https://biothings.ncats.io/semmeddb2 API as part of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/569 to investigate filtering strategies to improve signal/noise, let's also join in information about the Named Entity Recognition (NER) from the PREDICATION_AUX table (https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html):

I can really only imagine us using the SUBJECT_TEXT AND SUBJECTSCORE values (plus the corresponding OBJECT values), so let's focus on those. We can add these values to the predication object at the same level as the predication_id:

colleenXu commented 1 year ago

FYI: I also saw some possible predicate info and sentence-predication confidence info:

I don't see any "score" for the relationship selection ("predicate"), but there's also a SCORE column in the ENTITY table (that seems to relate to individual sentences).

There are columns in the PREDICATION_AUX table to identify the position of the string that was used to pick the predicate (PREDICATE_START_INDEX and PREDICATE_END_INDEX).

originally posted here

erikyao commented 1 year ago

Example#1: `Plasmids` STIMULATES `Dihydrofolate Reductase`

Click me

```python { "_id": "C0032136-STIMULATES-C0039667", "predicate": "STIMULATES", "predication": [ { "predication_id": 84692642, "pmid": 360038, "sentence_id": 43528009, "sentence": "The R factor induced enzyme was partially purified from a strain carrying a multicopy recombinant plasmid into which the 1770 bp fragment was inserted and which induced high levels of dihydrofolate reductase.", "subject_text": "plasmid", "subject_score": 773, "object_text": "dihydrofolate reductase", "object_score": 1000 }, { "predication_id": 102407545, "pmid": 6407900, "sentence_id": 74212980, "sentence": "A plasmid mutation has been identified that increases expression of mouse DHFR more than ten-fold.", "subject_text": "plasmid", "subject_score": 888, "object_text": "DHFR", "object_score": 824 } ], "pmid_count": 2, "predication_count": 2, "subject": { "umls": "C0032136", "name": "Plasmids", "semantic_type_abbreviation": "bacs", "semantic_type_name": "Biologically Active Substance", "novelty": 1 }, "object": { "umls": "C0039667", "name": "Dihydrofolate Reductase", "semantic_type_abbreviation": "gngm", "semantic_type_name": "Gene or Genome", "novelty": 1 } } ```

Example#2: `CDK3 gene` INTERACTS_WITH `activating transcription factor 1`

Click me

```python { "_id": "C1332734-INTERACTS_WITH-C0214635", "predicate": "INTERACTS_WITH", "predication": [ { "predication_id": 125412171, "pmid": 18794154, "sentence_id": 120609124, "sentence": "Cyclin-dependent kinase 3-mediated activating transcription factor 1 phosphorylation enhances cell transformation.", "subject_text": "Cyclin-dependent kinase 3", "subject_score": 849, "object_text": "activating transcription factor 1", "object_score": 849 }, { "predication_id": 125412566, "pmid": 18794154, "sentence_id": 120609128, "sentence": "Furthermore, we found that cdk3 phosphorylates activating transcription factor 1 (ATF1) at serine 63 and enhances the transactivation and transcriptional activities of ATF1.", "subject_text": "cdk3", "subject_score": 1000, "object_text": "activating transcription factor 1", "object_score": 1000 } ], "pmid_count": 1, "predication_count": 2, "subject": { "umls": "C1332734", "name": "CDK3 gene", "semantic_type_abbreviation": [ "aapp", "gngm" ], "semantic_type_name": [ "Amino Acid, Peptide, or Protein", "Gene or Genome" ], "novelty": 1 }, "object": { "umls": "C0214635", "name": "activating transcription factor 1", "semantic_type_abbreviation": "aapp", "semantic_type_name": "Amino Acid, Peptide, or Protein", "novelty": 1 } } ```

Example#3: `C1333570-CAUSES-C0023882`

Questionable NER data:

Text little should not be connected to concept Little's Disease
Text PSMs (plant secondary metabolites) should not be connected to concept FOLH1 gene
- FOLH1 gene is one of PSMA (Prostate-Specific Membrane Antigen), and this could be the cause of the mistake.

Click me

```python { "_id": "C1333570-CAUSES-C0023882", "predicate": "CAUSES", "predication": [ { "predication_id": 182378913, "pmid": 31580494, "sentence_id": 342370606, "sentence": "Ambient temperature has been shown to alter liver function in rodents and the toxicity of some PSMs, but little is known about the physiological and nutritional consequences of consuming PSMs at different ambient temperatures.", "subject_text": "PSMs", "subject_score": 827, "object_text": "little", "object_score": 1000 } ], "pmid_count": 1, "predication_count": 1, "subject": { "umls": "C1333570", "name": "FOLH1 gene", "semantic_type_abbreviation": "gngm", "semantic_type_name": "Gene or Genome", "novelty": 1 }, "object": { "umls": "C0023882", "name": "Little's Disease", "semantic_type_abbreviation": "dsyn", "semantic_type_name": "Disease or Syndrome", "novelty": 1 } } ```

erikyao commented 1 year ago

Statistics of all NER stats

STAT	`subject_score`	`object_score`
TOTAL	122611719	122611719
MIN	0	0
MAX	1000	1000
MEAN	927.23	922.86
MEDIAN	916	901
2.5TH PERCENTILE	766	759
25TH PERCENTILE	888	872
50TH PERCENTILE	916	901
75TH PERCENTILE	1000	1000
97.5TH PERCENTILE	1000	1000

So NER shows high confidence in the connection between the entity texts and concepts. A threshold around 800 seems weak.

erikyao commented 1 year ago

Statistics of predication list lengths (i.e. `predication_count` values in existing documents)

STAT	`predication_count`
TOTAL	24481939
MIN	1
MAX	64451
MEAN	3.65
MEDIAN	1
2.5TH PERCENTILE	1
25TH PERCENTILE	1
50TH PERCENTILE	1
75TH PERCENTILE	2
97.5TH PERCENTILE	15

The documents with the max predication_count is exactly C0023884-PART_OF-C0034693 (Liver PART_OF Rattus norvegicus) which caused the BSONObjectTooLarge error to MongoDB.

If we apply a threshold of 1000 to the length of predication lists, 4,293 documents out of 24,481,939 (i.e. 0.0175%) will be affected.

erikyao commented 1 year ago

14 predication records, invovling 10 semmeddb2 documents, have NO NER stats from the source data. They are:

`predication_id`	doc `_id`
201544459	C0871124-PROCESS_OF-C0008059
203800327	C0039185-STIMULATES-C2936529
201986825	C1709820-PART_OF-C0029045
192912334	N/A
196519528	N/A
196519532	N/A
198555342	C4321237-PROCESS_OF-C0008059
199170923	C0429845-USES-C1709820
192883183	C0020538-COEXISTS_WITH-C0871124
201986826	C0018207-LOCATION_OF-6098
203209176	C1422771-compared_with-C1418270
203209174	C0206131-LOCATION_OF-C1418270
205429882	N/A
201469895	C1817666-PROCESS_OF-C0024432

erikyao commented 1 year ago

@colleenXu @andrewsu semmeddb2 updated with NER stats in 4 new fields:

{
    "_id": "...",
    "predication": [
       {
            "object_score": <int>,
            "object_text": <str>,
            "subject_score": <int>,
            "subject_text": <str>
       },
       # omitted
    ]
}

andrewsu commented 1 year ago

Super, this looks great, thanks!

biothings / biothings_explorer