MarkDWilliams commented 3 years ago

Hate to bother with more issues, but I'm only finding these because of how useful I've found ARAX/ARAX interface 😄 I tried running the following query for finding chemicals connected to SPMSY, which mostly works fine. However, the top result is Octreotide, which doesn't seem to have any relation that I could find to SPMSY, and it's linked by the predicate biolink:close_match which is valid biolink, but doesn't make sense in this context. Looking through the PMIDs linked as evidence, they seem to be about somatostatin. Could be a parsing issue in Semmed between SMS (spermine synthase) and somatostatin analogues. Low priority, but I thought I'd give y'all a heads up.

    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "id": "UniProtKB:P52788",
                "n1": {
                    "category": "biolink:ChemicalSubstance"
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"


dkoslicki commented 3 years ago

Thanks @MarkDWilliams , keep the issues coming! I’ve tagged our KG2 people to take a look at this. Most likely it’s a mis-synonymized node or the like.

amykglen commented 3 years ago

it looks like the questionable edge may be between "NCBIGene:6611" and "UMLS:C0728977" in KG2.3.4:

match p=(n)-[:close_match]-(m) where n.id in    ["UMLS:C1420262", "HGNC:11123", "UMLS:C0795864", "MEDDRA:10081680", "UMLS:C0037875", "MESH:D013097", "UMLS:C3661485", "MESH:D058496", "UMLS:C1866927", "NCIT:C75469", "OMIM:300105", "OMIM:182290", "SNOMED:401315004", "SNOMED:64142003", "ORPHANET:819", "ORPHANET:138543", "MONDO:0008434", "DOID:0060768", "PR:P52788", "PR:000015301", "UniProtKB:P52788", "ENSEMBL:ENSG00000102172", "CHEMBL.TARGET:CHEMBL4934", "NCBIGene:6611", "PathWhiz.ProteinComplex:7228", "PathWhiz.ProteinComplex:4235", "PathWhiz.ProteinComplex:337", "PathWhiz.ProteinComplex:9617", "PathWhiz.ProteinComplex:9011", "PathWhiz.ProteinComplex:10029", "PathWhiz.ProteinComplex:8353"] and m.id in     ["UMLS:C0028833", "ATC:H01CB02", "DRUGBANK:DB00104", "LOINC:MTHU035158", "LOINC:LP97954-9", "UMLS:C0724649", "UMLS:C0728977", "MESH:D015282", "UMLS:C1170602", "UMLS:C1518540", "UMLS:C1518539", "UMLS:C1521974", "UMLS:C1709307", "NCIT:C53447", "NCIT:C711", "UMLS:C1328681", "UMLS:C1518541", "UMLS:C0338271", "NCIT:C2402", "NDDF:004012", "NDDF:002107", "PDQ:CDR0000038866", "UMLS:C1328682", "UMLS:C1328680", "PDQ:CDR0000042358", "RXNORM:221130", "RXNORM:7617", "SNOMED:109055007", "SNOMED:109053000", "VANDF:4019472", "VANDF:4019864", "CHEBI:7726", "CHEMBL.COMPOUND:CHEMBL262746", "CHEMBL.COMPOUND:CHEMBL1680", "CHEMBL.COMPOUND:CHEMBL1200480", "CHEMBL.COMPOUND:CHEMBL2105834", "CHEMBL.COMPOUND:CHEMBL3182554", "CHEMBL.COMPOUND:CHEMBL3350037", "CHEMBL.COMPOUND:CHEMBL3545066", "HMDB:HMDB0014262", "KEGG:C07306"] return p limit 10
Screen Shot 2021-02-10 at 12 41 40 PM

although there also seems to be a synonymization issue for SPMSY (UniProtKB:P52788):

Screen Shot 2021-02-10 at 12 43 45 PM

I'll write this mis-synonymization up in a separate issue! (#1250)

edeutsch commented 3 years ago

@MarkDWilliams thanks for reporting, please don't ever hesitate to "bother".

However, you should be aware that we have a systemic problem with certain concepts being merged with others that should not be. There is a plan for fixing this, but an exact timeline is not currently known. It affects a small percentage of nodes, but with so many nodes, it is a substantial number overall. Your exact issue will not be fixed super soon since a broader fix is underway, but it will be fixed in the not too distant feature. We will use this example to verify the fix and will let you know when it is working. Feel free to report more, or skip the reporting of more until the broader fix is in place and many such issues are resolved, as you wish.

saramsey commented 3 years ago

-- | -- | -- | --

Here is the problem. Based on my recent fix to issue #1204, I think the next build of KG2 would once again have this edge as biolink:close_match, which is probably not what we want. I think for SemMedDB xref edges should be mapped tobiolink:related_to instead of biolink:close_match. So I will fix that in semmeddb_tuple_list_json_to_kg_json.py.

saramsey commented 3 years ago

Just to amplify what @dkoslicki and @edeutsch already said above: we appreciate bug reports, especially bugs that pertain to ARAX as seen via the ARS / TRAPI.

ecwood commented 3 years ago

This appears to be fixed in KG2.5.2, where the cypher query:

match (n {id: 'NCBIGene:6611'})<-[r]-(m {id: 'UMLS:C0728977'}) return r.predicate, r.predicate_label, r.relation, r.relation_label;


r.predicate r.predicate_label r.relation r.relation_label
"biolink:related_to" "related_to" "SEMMEDDB:xref" "xref"