RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 20 forks source link

normalization issue related to "coenzyme A" #2058

Closed saramsey closed 1 year ago

saramsey commented 1 year ago

In KG2.8.3c, we have to separate nodes that I think both correspond to "coenzyme A" and should be normalized:

for more details and evidence, see https://github.com/RTXteam/RTX-KG2/issues/282

saramsey commented 1 year ago

Seems like we could detect that the first node needs to be coalesced to the second node based on comparing the name field of the second node to the all_names field of the first node, right?

amykglen commented 1 year ago

thanks Steve. the SRI Node Normalizer has these as two separate clusters, and we now align with the SRI wherever possible, so that's why we have two clusters.

https://nodenorm.transltr.io/1.3/get_normalized_nodes?curie=PUBCHEM.COMPOUND:6816 https://nodenorm.transltr.io/1.3/get_normalized_nodes?curie=PUBCHEM.COMPOUND:87642

though I agree this seems like a clear merge miss. maybe we should report it to the SRI: https://github.com/TranslatorSRI/NodeNormalization/issues

I want to play with canonicalizing KG2 without strictly adhering to the SRI's clustering, but it seems like that probably shouldn't be the KG2c we use for Translator, due to how central the SRI Node Normalizer has become...

by the way, the cluster graphs in the ARAX UI are also useful (they include provenance - i.e., SRI vs. KG2 vs. name similarity edges): https://arax.ncats.io/test/?term=PUBCHEM.COMPOUND:6816 https://arax.ncats.io/test/?term=PUBCHEM.COMPOUND:87642

saramsey commented 1 year ago

@amykglen yes could you please report this issue to the SRI? Thanks.

amykglen commented 1 year ago

sure, done! https://github.com/TranslatorSRI/NodeNormalization/issues/203

can we close this issue since the ARAX NodeSynonymizer is technically behaving how we intended?

amykglen commented 1 year ago

so it turns out those two Coenzyme A clusters differ in chirality; they have different inchi keys - see Chris B.'s response here: https://github.com/TranslatorSRI/NodeNormalization/issues/203#issuecomment-1585824197

it looks like there are a total of 3 nodes in KG2pre that were tacked onto one of these two clusters based on name only (they weren't recognized by the SRI NodeNormalizer). so unless the naming tendencies differ for those two chiral molecules, the Node Synonymizer very well may have gotten their assignments wrong..