NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

SemMedDB filtering based on novelty score #432

Open andrewsu opened 1 year ago

andrewsu commented 1 year ago

As a strategy to remove generic answers/nodes, it was suggested that filtering out semmeddb records based on the novelty score could be useful. Novelty is (very briefly) described in https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html:

image

This issue tracks discussion on this topic.

andrewsu commented 1 year ago

To show example entities would be excluded by a filter that required novelty:1, I took ~the first \~4.5 million~ all ~10 million subjects in the latest predications file with novelty:0, printed the UMLS ID, the name, and the semantic type. Then I sorted them by how many times each node appeared (counts in the first column; first 50 lines shown below):

2446984 "C0087111","Therapeutic procedure","topp"
1409781 "C0007634","Cells","cell"
 644966 "C0030705","Patients","humn"
 599740 "C0012634","Disease","dsyn"
 498028 "C0013227","Pharmaceutical Preparations","phsu"
 279206 "C0017337","Genes","gngm"
 248044 "C1457887","Symptoms","sosy"
 237167 "C0033684","Proteins","aapp"
 221809 "C1273869","Intervention regimes","hlca"
 196269 "C0009566","Complication","patf"
 152704 "C0184661","Procedures","hlca"
 139243 "C0039082","Syndrome","dsyn"
 138161 "C0936012","Analysis","resa"
 134892 "C0030705","Patients","podg"
 122048 "C0011900","Diagnosis","hlca"
 100171 "C0014442","Enzymes","enzy"
  99248 "C0035668","RNA","bacs"
  88265 "C0597198","Performance","inbe"
  85559 "C0277785","Functional disorder","patf"
  82701 "C0243192","agonists","phsu"
  66915 "C0003062","Animals","anim"
  66118 "C0184661","Interventional procedure","topp"
  62350 "C0031843","physiological aspects","phsf"
  58818 "C0020114","Human","humn"
  58288 "C0004927","Behavior","inbe"
  54838 "C0030956","Peptides","aapp"
  53147 "C1185740","Tract","bpoc"
  51306 "C0879626","Adverse effects","patf"
  47253 "C1257890","Population Group","humn"
  42038 "C0029235","Organism","orgm"
  41860 "C0311392","Physical findings","sosy"
  41563 "C0431085","[M]Unspecified tumor cell NOS","cell"
  38501 "C0017428","Genome","gngm"
  38141 "C0687732","Prevention","topp"
  37707 "C0031327","Drug Kinetics","phsf"
  32972 "C0011900","Diagnosis","diap"
  31817 "C0237401","Individual","humn"
  29861 "C0205148","Surface","spco"
  29216 "C0033684","Proteins","bacs"
  28453 "C0205147","Region","spco"
  27683 "C0877248","Adverse event","fndg"
  25412 "C0679670","network","popg"
  24998 "C1185625","Compartments","bsoj"
  24614 "C0001779","Age","orga"
  23612 "C0019932","Hormones","horm"
  23410 "C0020114","Human","grup"
  22836 "C0005839","Blood supply aspects","bpoc"
  22677 "C0035168","research","resa"
  22615 "C0486805","FRAGMENTS","bdsu"
  22307 "C0017337","Genes","aapp"

If useful, the full list of entities with novelty: 0 (sorted by count) is here: novelty0.txt

To my eye, I think we are safe removing predications involving entities like these...

cbizon commented 1 year ago

Agreed. Would looking at this as a fraction of how often the entity occurs do anything? So maybe there aren't that many absolute instances of Chromosomes, but all of its instances are marked as novelty=0

saramsey commented 1 year ago

Yes, I think it could improve reasoning to leverage the SemMed novelty score. RTX-KG2pre already includes these novelty scores, so they are a part of our ETL of SemMedDB, already. The real work would be adapting ARAX to make use of them (I guess I'm assuming that it is intended to use the novelty scores for result ranking within an ARA; is that right?).

andrewsu commented 1 year ago

I guess I'm assuming that it is intended to use the novelty scores for result ranking within an ARA; is that right?

sorry for the slow reply -- OOO. I think the intent is that predications for which either the subject or object has novelty=0 would not be used in any reasoning and would not be reported back to the ARS/UI. We've chosen to do filter out those records during our ETL process for the semmeddb API...

saramsey commented 1 year ago

At least in ARAX, it seems like we might be able to address this issue alongside the publication filtering (see RTXteam/RTX issue 2045).

sierra-moxon commented 8 months ago

from TAQA: we would need all the ingestors to let us know if they have novelty in their graph for SEMMED edges. This does not feel like a priority right now. Sierra - need a new label to put this kind of ticket in a backlog that is not of priority to fix.