Closed sstemann closed 1 year ago
Looks like this is a bad semmed edge. It's coming from RTX-KG2, so flagging @saramsey (though I think it's a broader semmed problem)
I wonder if we shouldn't have a blacklist for certain terms in semmed which it thinks are genes when they are not (e.g. RACE). @andrewsu thoughts?
Possible solutions:
Having said all that, I actually suspect something odd might (also?) be going on with RTX-KG2's processing of semmeddb specifically. The result shown above shows NCBIGene:8623000
as the CURIE for racE, which resolves to the racE gene in Dictyostelium. I doubt that SemMedDB is actually asserting anything about this racE gene because there doesn't seem to be a UMLS ID for this concept, neither in the Metathesaurus browser nor in Node Normalizer. Taking a look at the semmed results for one of the cited publications (PMID:10319190), SemMedDB does have an assertion about "Racial group" (C0034510), but not about any gene or protein (aside from "Cytochrome P450").
"105097643","78576002","10319190","ISA","C0010762","Cytochrome P450","aapp","1","C0013227","Pharmaceutical Preparations","phsu","0",\N,\N,\N
"105097747","78576004","10319190","USES","C0087111","Therapeutic procedure","topp","0","C0013227","Pharmaceutical Preparations","phsu","0",\N,\N,\N
"105097776","78576004","10319190","TREATS","C0013227","Pharmaceutical Preparations","phsu","0","C0034510","Racial group","popg","1",\N,\N,\N
"105097864","78576006","10319190","AFFECTS","C0013227","Pharmaceutical Preparations","phsu","0","C0243102","enzyme activity","moft","1",\N,\N,\N
"105097903","78576006","10319190","AFFECTS","C0010762","Cytochrome P450","aapp","1","C0025519","Metabolism","orgf","1",\N,\N,\N
Thanks, the RTX-KG2 team will look into this.
We will track this as issue 257 in RTX-KG2. When we have traced the root cause of the issue in RTX-KG2, we'll report back here as well.
We've looked into this issue (thank you @cbizon for bringing the RTX-KG2 role in this issue, to our attention) and yes, we believe the odd result reported by @sstemann is occurring because of an incorrect conflation of two concepts in RTX-KG2c, namely, "race" (the population concept) and "racE", the Rho GTPase gene in the slime mold Dictystelium discoideum (sp?). If those two concepts had not been conflated in RTX-KG2c, then we do not think this would have appeared as a result in Translator, because it would have not been a connected two-hop path between "Pharmaceutical preparation" and "Rhobtb2b". We will fix this in a forthcoming release of RTX-KG2.
We note, in passing, that Rhobtb2b does not appear to be a valid human gene symbol. It appears to be a zebrafish (Danio rerio) gene symbol. Maybe the intended human gene was RHOBTB2?
Possible solutions:
* **Exclusion list**: I tend to lean away from exclusion lists for semmed because I think it is playing whack-a-mole in a field full of moles. Very tough to do this in a sustainable way. * **"NOVELTY" filter**: In Service Provider's Semmeddb API, we've filtered based on the "Novelty score" to remove all triples involving very general terms like "Pharmaceutical Preparations" (ie, no hits for https://biothings.ncats.io/semmeddb/query?q=subject.umls:C0013227%20OR%20object.umls:C0013227). In general, I think this has been a very good data cleaning step for us * **Filtering by PMID count**: We could filter semmeddb by the number of supporting PMIDs, as I think will be explored at the upcoming relay. I suspect lots of noise is from triples that are only supported by a single publication * **Filtering by NER score**: One option is to compare to SemMedDB against Pubtator (as suggested in
Thank you, @andrewsu. Great ideas. Maybe could use some (or all) of these for ranking results?
Great ideas. Maybe could use some (or all) of these for ranking results?
Yes, good idea, I think those features could absolutely be incorporated into a ranking/scoring scheme. On the BTE side, we're starting with the poor man's approach of removing records that don't pass reasonable filters (though clearly "reasonable" is subjective and in some cases TBD)...
the conflation of the protein racE with the term "Race" has been fixed in RTX-KG2 and ARAX dev/CI instances: https://arax.ncats.io/test/?term=NCBIGene:8623000
so I think this issue can be closed? (@sstemann)
Steps:
In UI - CI (which is connected to ARS CI) ran "What chemical upregulates [Rhobtb2b]" Results here: https://ui.ci.transltr.io/results?q=6c48f244-779d-4297-890e-2756f9a1152b
Expand the first few results and you see drug/chemical entities [edge: treats] entity "Race" [entity type macromolecular machine mixin]
Click the predicate treats
None of the pubmeds are looking at racE, but rather race
Not clear how these publications are related to RHOBTB2B if they aren't relevant to racE the gene.