NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

CI - Results reads "Pharmaceutical Preparations Treats Race" - which may be either/all a capitalization issue, node typing issue, or ARA issue #111

Closed sstemann closed 1 year ago

sstemann commented 1 year ago

Steps:

  1. In UI - CI (which is connected to ARS CI) ran "What chemical upregulates [Rhobtb2b]" Results here: https://ui.ci.transltr.io/results?q=6c48f244-779d-4297-890e-2756f9a1152b

  2. Expand the first few results and you see drug/chemical entities [edge: treats] entity "Race" [entity type macromolecular machine mixin] image

  3. Click the predicate treats

image

  1. Review results in ARAX UI and determine ARAGORN is the only tool responding https://arax.ncats.io/?r=6c48f244-779d-4297-890e-2756f9a1152b
  2. View the first result, entity looks like racE, NCBIGene:8623000, which looks like maybe its just a capitalization or node-type issue in the UI, but I think the UI is still displaying the results as is given that the first "Categories" is macromolecular machine mixin"
  3. Next, look at the evidence on the predicate

image

  1. None of the pubmeds are looking at racE, but rather race image

  2. Not clear how these publications are related to RHOBTB2B if they aren't relevant to racE the gene.

cbizon commented 1 year ago

Looks like this is a bad semmed edge. It's coming from RTX-KG2, so flagging @saramsey (though I think it's a broader semmed problem)

cbizon commented 1 year ago

I wonder if we shouldn't have a blacklist for certain terms in semmed which it thinks are genes when they are not (e.g. RACE). @andrewsu thoughts?

andrewsu commented 1 year ago

Possible solutions:

Having said all that, I actually suspect something odd might (also?) be going on with RTX-KG2's processing of semmeddb specifically. The result shown above shows NCBIGene:8623000 as the CURIE for racE, which resolves to the racE gene in Dictyostelium. I doubt that SemMedDB is actually asserting anything about this racE gene because there doesn't seem to be a UMLS ID for this concept, neither in the Metathesaurus browser nor in Node Normalizer. Taking a look at the semmed results for one of the cited publications (PMID:10319190), SemMedDB does have an assertion about "Racial group" (C0034510), but not about any gene or protein (aside from "Cytochrome P450").

"105097643","78576002","10319190","ISA","C0010762","Cytochrome P450","aapp","1","C0013227","Pharmaceutical Preparations","phsu","0",\N,\N,\N
"105097747","78576004","10319190","USES","C0087111","Therapeutic procedure","topp","0","C0013227","Pharmaceutical Preparations","phsu","0",\N,\N,\N
"105097776","78576004","10319190","TREATS","C0013227","Pharmaceutical Preparations","phsu","0","C0034510","Racial group","popg","1",\N,\N,\N
"105097864","78576006","10319190","AFFECTS","C0013227","Pharmaceutical Preparations","phsu","0","C0243102","enzyme activity","moft","1",\N,\N,\N
"105097903","78576006","10319190","AFFECTS","C0010762","Cytochrome P450","aapp","1","C0025519","Metabolism","orgf","1",\N,\N,\N
saramsey commented 1 year ago

Thanks, the RTX-KG2 team will look into this.

saramsey commented 1 year ago

We will track this as issue 257 in RTX-KG2. When we have traced the root cause of the issue in RTX-KG2, we'll report back here as well.

saramsey commented 1 year ago

We've looked into this issue (thank you @cbizon for bringing the RTX-KG2 role in this issue, to our attention) and yes, we believe the odd result reported by @sstemann is occurring because of an incorrect conflation of two concepts in RTX-KG2c, namely, "race" (the population concept) and "racE", the Rho GTPase gene in the slime mold Dictystelium discoideum (sp?). If those two concepts had not been conflated in RTX-KG2c, then we do not think this would have appeared as a result in Translator, because it would have not been a connected two-hop path between "Pharmaceutical preparation" and "Rhobtb2b". We will fix this in a forthcoming release of RTX-KG2.

We note, in passing, that Rhobtb2b does not appear to be a valid human gene symbol. It appears to be a zebrafish (Danio rerio) gene symbol. Maybe the intended human gene was RHOBTB2?

saramsey commented 1 year ago

Possible solutions:

* **Exclusion list**: I tend to lean away from exclusion lists for semmed because I think it is playing whack-a-mole in a field full of moles. Very tough to do this in a sustainable way.

* **"NOVELTY" filter**: In Service Provider's Semmeddb API, we've filtered based on the "Novelty score" to remove all triples involving very general terms like "Pharmaceutical Preparations" (ie, no hits for https://biothings.ncats.io/semmeddb/query?q=subject.umls:C0013227%20OR%20object.umls:C0013227).  In general, I think this has been a very good data cleaning step for us

* **Filtering by PMID count**: We could filter semmeddb by the number of supporting PMIDs, as I think will be explored at the upcoming relay. I suspect lots of noise is from triples that are only supported by a single publication

* **Filtering by NER score**: One option is to compare to SemMedDB against Pubtator (as suggested in

Thank you, @andrewsu. Great ideas. Maybe could use some (or all) of these for ranking results?

andrewsu commented 1 year ago

Great ideas. Maybe could use some (or all) of these for ranking results?

Yes, good idea, I think those features could absolutely be incorporated into a ranking/scoring scheme. On the BTE side, we're starting with the poor man's approach of removing records that don't pass reasonable filters (though clearly "reasonable" is subjective and in some cases TBD)...

amykglen commented 1 year ago

the conflation of the protein racE with the term "Race" has been fixed in RTX-KG2 and ARAX dev/CI instances: https://arax.ncats.io/test/?term=NCBIGene:8623000

so I think this issue can be closed? (@sstemann)