Closed andrewsu closed 1 year ago
FYI: I also saw some possible predicate info and sentence-predication confidence info:
I don't see any "score" for the relationship selection ("predicate"), but there's also a
SCORE
column in the ENTITY table (that seems to relate to individual sentences).There are columns in the PREDICATION_AUX table to identify the position of the string that was used to pick the predicate (
PREDICATE_START_INDEX
andPREDICATE_END_INDEX
).
originally posted here
Plasmids
STIMULATES Dihydrofolate Reductase
CDK3 gene
INTERACTS_WITH activating transcription factor 1
C1333570-CAUSES-C0023882
Questionable NER data:
little
should not be connected to concept Little's Disease
PSMs
(plant secondary metabolites) should not be connected to concept FOLH1 gene
FOLH1 gene
is one of PSMA
(Prostate-Specific Membrane Antigen), and this could be the cause of the mistake.STAT | subject_score |
object_score |
---|---|---|
TOTAL | 122611719 | 122611719 |
MIN | 0 | 0 |
MAX | 1000 | 1000 |
MEAN | 927.23 | 922.86 |
MEDIAN | 916 | 901 |
2.5TH PERCENTILE | 766 | 759 |
25TH PERCENTILE | 888 | 872 |
50TH PERCENTILE | 916 | 901 |
75TH PERCENTILE | 1000 | 1000 |
97.5TH PERCENTILE | 1000 | 1000 |
So NER shows high confidence in the connection between the entity texts and concepts. A threshold around 800 seems weak.
predication_count
values in existing documents)STAT | predication_count |
---|---|
TOTAL | 24481939 |
MIN | 1 |
MAX | 64451 |
MEAN | 3.65 |
MEDIAN | 1 |
2.5TH PERCENTILE | 1 |
25TH PERCENTILE | 1 |
50TH PERCENTILE | 1 |
75TH PERCENTILE | 2 |
97.5TH PERCENTILE | 15 |
The documents with the max predication_count
is exactly C0023884-PART_OF-C0034693
(Liver
PART_OF Rattus norvegicus
) which caused the BSONObjectTooLarge
error to MongoDB.
If we apply a threshold of 1000 to the length of predication lists, 4,293
documents out of 24,481,939
(i.e. 0.0175%
) will be affected.
14 predication records, invovling 10 semmeddb2
documents, have NO NER stats from the source data. They are:
predication_id |
doc _id |
---|---|
201544459 | C0871124-PROCESS_OF-C0008059 |
203800327 | C0039185-STIMULATES-C2936529 |
201986825 | C1709820-PART_OF-C0029045 |
192912334 | N/A |
196519528 | N/A |
196519532 | N/A |
198555342 | C4321237-PROCESS_OF-C0008059 |
199170923 | C0429845-USES-C1709820 |
192883183 | C0020538-COEXISTS_WITH-C0871124 |
201986826 | C0018207-LOCATION_OF-6098 |
203209176 | C1422771-compared_with-C1418270 |
203209174 | C0206131-LOCATION_OF-C1418270 |
205429882 | N/A |
201469895 | C1817666-PROCESS_OF-C0024432 |
@colleenXu @andrewsu semmeddb2
updated with NER stats in 4 new fields:
{
"_id": "...",
"predication": [
{
"object_score": <int>,
"object_text": <str>,
"subject_score": <int>,
"subject_text": <str>
},
# omitted
]
}
Super, this looks great, thanks!
Now that we've created the new https://biothings.ncats.io/semmeddb2 API as part of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/569 to investigate filtering strategies to improve signal/noise, let's also join in information about the Named Entity Recognition (NER) from the PREDICATION_AUX table (https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html):
I can really only imagine us using the SUBJECT_TEXT AND SUBJECTSCORE values (plus the corresponding OBJECT values), so let's focus on those. We can add these values to the
predication
object at the same level as thepredication_id
: