Closed saramsey closed 1 year ago
Seems like we could detect that the first node needs to be coalesced to the second node based on comparing the name
field of the second node to the all_names
field of the first node, right?
thanks Steve. the SRI Node Normalizer has these as two separate clusters, and we now align with the SRI wherever possible, so that's why we have two clusters.
https://nodenorm.transltr.io/1.3/get_normalized_nodes?curie=PUBCHEM.COMPOUND:6816 https://nodenorm.transltr.io/1.3/get_normalized_nodes?curie=PUBCHEM.COMPOUND:87642
though I agree this seems like a clear merge miss. maybe we should report it to the SRI: https://github.com/TranslatorSRI/NodeNormalization/issues
I want to play with canonicalizing KG2 without strictly adhering to the SRI's clustering, but it seems like that probably shouldn't be the KG2c we use for Translator, due to how central the SRI Node Normalizer has become...
by the way, the cluster graphs in the ARAX UI are also useful (they include provenance - i.e., SRI vs. KG2 vs. name similarity edges): https://arax.ncats.io/test/?term=PUBCHEM.COMPOUND:6816 https://arax.ncats.io/test/?term=PUBCHEM.COMPOUND:87642
@amykglen yes could you please report this issue to the SRI? Thanks.
sure, done! https://github.com/TranslatorSRI/NodeNormalization/issues/203
can we close this issue since the ARAX NodeSynonymizer is technically behaving how we intended?
so it turns out those two Coenzyme A clusters differ in chirality; they have different inchi keys - see Chris B.'s response here: https://github.com/TranslatorSRI/NodeNormalization/issues/203#issuecomment-1585824197
it looks like there are a total of 3 nodes in KG2pre that were tacked onto one of these two clusters based on name only (they weren't recognized by the SRI NodeNormalizer). so unless the naming tendencies differ for those two chiral molecules, the Node Synonymizer very well may have gotten their assignments wrong..
In KG2.8.3c, we have to separate nodes that I think both correspond to "coenzyme A" and should be normalized:
first node:
id
:PUBCHEM.COMPOUND:6816
name
:"[[(2R,3S,4R,5R)-5-(6-aminopurin-9-yl)-4-hydroxy-3-phosphonooxyoxolan-2-yl]methoxy-hydroxyphosphoryl] [3-hydroxy-2,2-dimethyl-4-oxo-4-[[3-oxo-3-(2-sulfanylethylamino)propyl]amino]butyl] hydrogen phosphate"
equivalent_curies
:["PathWhiz.Compound:1099", "INCHIKEY:RGJOEKWQDUBAIZ-DRCCLKDXSA-N", "PUBCHEM.COMPOUND:6816", "CHEMBL.COMPOUND:CHEMBL1623949", "KEGG.COMPOUND:C00010"], id: "PUBCHEM.COMPOUND:6816", category: "biolink:SmallMolecule"
all_names
:["Coenzyme A"]
second node:
id
:PUBCHEM.COMPOUND:87642
name
:"coenzyme A"
equivalent_curies
:["UNII:SAA04E81UX", "RXNORM:1314344", "UMLS:C0009140", "DRUGBANK:DB01992", "CAS:143180-18-1", "INCHIKEY:RGJOEKWQDUBAIZ-IBOSZNHHSA-N", "CHEBI:15346", "HMDB:HMDB0001423", "CHEMBL.COMPOUND:CHEMBL1213327", "NCIT:C384", "MESH:D003065", "GTOPDB:3044", "PUBCHEM.COMPOUND:87642"]
for more details and evidence, see https://github.com/RTXteam/RTX-KG2/issues/282