ExposuresProvider / cam-pipeline

Data loading pipeline for CAM database
https://exposuresprovider.github.io/cam-pipeline/
MIT License
2 stars 4 forks source link

CAM KP does not respond to any of ICEES KG-derived input CURIES #101

Open karafecho opened 1 year ago

karafecho commented 1 year ago

This issue is to report that CAM KP does not respond to any of the ICEES KG-derived CURIES in this sheet and also appended below. Is this expected behavior? Is this a normalization issue? Is this something else?

ICEES input CURIES for MVP2 queries ExposuresProvider/cam-kp-api#81 and ExposuresProvider/cam-kp-api#82 |  
ICEES query: What chemicals are associated with primary ciliary dyskinesia? |  
[Jupyter notebook](https://colab.research.google.com/drive/1CdO0XtUddVzt5bRzlcTs8tW0SFc2ggkP#scrollTo=ZVvv2zGGsr-4&uniqifier=1)

ENTITIES_OF_INTEREST
'PUBCHEM.COMPOUND:5865',            # Prednisone |  
'CHEMBL.COMPOUND:CHEMBL1256818',    # Dextromethorphan hydrobromide |  
'PUBCHEM.COMPOUND:165363555',       # Trifacta |  
"HMDB:HMDB0252416",                 # Fluticasone |  
"PUBCHEM.COMPOUND:123600",          # Levalbuterol |  
"HMDB:HMDB0242500",                 # Budesonide |  
"CHEBI:5147",                       # Formoterol |  
"CHEMBL.COMPOUND:CHEMBL158"         # Aztreonam |  
"PUBCHEM.COMPOUND:145068"           # Nitric oxide |  
"PUBCHEM.COMPOUND:281",             # Carbon monoxide |  
gaurav commented 8 months ago

Thanks for this identifiers! I've added these identifiers to the brand new Automat-CAM-KP test suite (https://github.com/ExposuresProvider/cam-pipeline/pull/111), and here are the results I have:

CURIE Normalized to How many unique CURIEs is this connected to in Automat-CAM-KP?
PUBCHEM.COMPOUND:5865 Normalized 30
CHEMBL.COMPOUND:CHEMBL1256818 PUBCHEM.COMPOUND:5462351 None
PUBCHEM.COMPOUND:165363555 Normalized None
HMDB:HMDB0252416 PUBCHEM.COMPOUND:2462 None
PUBCHEM.COMPOUND:123600 Normalized None
HMDB:HMDB0242500 PUBCHEM.COMPOUND:2462 None
CHEBI:5147 PUBCHEM.COMPOUND:3410 None
CHEMBL.COMPOUND:CHEMBL158 PUBCHEM.COMPOUND:5742832 9
PUBCHEM.COMPOUND:145068 Normalized 258
PUBCHEM.COMPOUND:281 Normalized 64
gaurav commented 8 months ago

@balhoff Do you have thoughts on figuring out how to plug in the gaps we see here in node coverage? I'm guessing we need new data sources.

karafecho commented 8 months ago

Thanks, @gaurav! While we don't have a 1:1 match between CURIEs, the matches that we do have are representative, with two drugs and two chemical exposures, and will allow us to move this effort along.

karafecho commented 8 months ago

This Swagger example query runs successfully, but it returns 0 results. If I replace the input CURIES with PUBCHEM.COMPOUND:5865 from the table above, the query also runs successfully, but it returns 0 results. I think the Automat example queries are standardized and not tailored to the underlying KGs, so perhaps you can send me an example query that returns results from CAM KP? Thanks!

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "categories": [
            "biolink:ChemicalEntity"
          ],
          "ids": [
            "CHEMBL.COMPOUND:CHEMBL3234626",
            "CHEMBL.COMPOUND:CHEMBL3234633"
          ]
        },
        "n1": {
          "categories": [
            "biolink:GeneOrGeneProduct"
          ],
          "ids": [
            "NCBIGene:2099"
          ]
        }
      },
      "edges": {
        "e01": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:affects"
          ],
          "qualifier_constraints": [
            {
              "qualifier_set": [
                {
                  "qualifier_type_id": "biolink:object_aspect_qualifier",
                  "qualifier_value": "activity"
                },
                {
                  "qualifier_type_id": "biolink:object_direction_qualifier",
                  "qualifier_value": "increased"
                },
                {
                  "qualifier_type_id": "biolink:qualified_predicate",
                  "qualifier_value": "biolink:causes"
                }
              ]
            }
          ]
        }
      }
    }
  },
  "workflow": [
    {
      "id": "lookup"
    }
  ]
}
gaurav commented 8 months ago

Hi Kara! Sorry about the confusion: that Swagger example query can't currently be configured for individual platers, so we share a single Swagger with all the platers on Automat. That one isn't relevant to us, and has two main problems:

  1. CAM-KP doesn't currently know about CHEMBL.COMPOUND:CHEMBL3234626 or CHEMBL.COMPOUND:CHEMBL3234633. We'd have to ingest new pathways to provide information on them. We do have information on NCBIGene:2099.
  2. CAM-KP can't handle qualifiers until https://github.com/ExposuresProvider/cam-pipeline/pull/104 has been incorporated, which we're hoping to do really soon!

So the following query will work:

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "categories": [
            "biolink:ChemicalEntity"
          ]
        },
        "n1": {
          "categories": [
            "biolink:GeneOrGeneProduct"
          ],
          "ids": [
            "NCBIGene:2099"
          ]
        }
      },
      "edges": {
        "e01": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:affects"
          ]
        }
      }
    }
  },
  "workflow": [
    {
      "id": "lookup"
    }
  ]
}
karafecho commented 8 months ago

No confusion, I was aware that the Swagger examples aren't really "examples" for most of the Automats, including cam-kp and icees-kg. Thanks for an actual example query!

karafecho commented 8 months ago

This query returns results when sent directly to automat-icees-kg at https://automat.renci.org/#/.

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "categories": [
            "biolink:DiseaseOrPhenotypicFeature"
          ],
          "ids": [
            "MONDO:0009061"
          ]
        },
        "n1": {
          "categories": [
            "biolink:ChemicalEntity"
          ]
        }
      },
      "edges": {
        "e01": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:correlated_with"
          ]
        }
      }
    }
  },
  "workflow": [
    {
      "id": "lookup"
    }
  ]
}

And this query returns responses when sent directly to automat-cam-kp at https://automat.renci.org/#/.

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "categories": [
            "biolink:ChemicalEntity"
          ],
          "ids": [
            "PUBCHEM.COMPOUND:5865"
          ]
        },
        "n1": {
          "categories": [
            "biolink:GeneOrGeneProduct"
          ]
        }
      },
      "edges": {
        "e01": {
          "subject": "n0",
          "object": "n1",
          "predicates": [
            "biolink:affects"
          ]
        }
      }
    }
  },
  "workflow": [
    {
      "id": "lookup"
    }
  ]
}

But this query, while able to run successfully, returns an empty response when sent to WFR at https://translator-workflow-runner.renci.org/docs#/trapi/run_workflow_query_post.

{
    "workflow": [
        {
            "id": "lookup"
        },
        {
            "id":"score"
        }
    ],
    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "predicates": [
                        "biolink:correlated_with"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "provided_by": {
                        "allowlist": [
                            "infores:automat-icees-kg"
                        ]
                    }
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2",
                    "predicates": [
                        "biolink:affects"
                    ],
                    "provided_by": {
                        "allowlist": [
                            "infores:automat-cam-kp"
                        ]
                    }
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "MONDO:0009061"
                    ],
                    "is_set": false
                },
                "n1": {
                    "categories": [
                        "biolink:ChemicalEntity"
                    ],
                    "is_set": false
                },
                "n2": {
                    "categories": [
                        "biolink:GeneOrGeneProduct"
                    ],
                    "is_set": false
                }
            }
        }
    }
}
maximusunc commented 8 months ago

This comes from going through ARAs that have strict kp timeouts vs sending queries directly to kps. I also wasn't able to get any results from the WFR, but sending directly to Aragorn with an extended timeout returns a 16.6MB response. 12k results in total. ICEES-KG took 35 seconds to respond to the first hop (normal timeout is 10s) and returned 106 results, and then CAM-KP took 90 seconds to respond with the 12k results. If you want, I can share entire response.

karafecho commented 8 months ago

Thanks, Max.

Given your findings, then the revised query below should run when sent to WFR and return results. However, while it runs successfully, it returns an empty KG.

{
    "workflow": [
        {
            "id": "lookup",
            "runner_parameters": {
                "allowlist": ["infores:aragorn"]
            }
        },
        {
            "id":"score"
        }
    ],
    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "predicates": [
                        "biolink:correlated_with"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "provided_by": {
                        "allowlist": [
                            "infores:automat-icees-kg"
                        ]
                    }
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2",
                    "predicates": [
                        "biolink:affects"
                    ],
                    "provided_by": {
                        "allowlist": [
                            "infores:automat-cam-kp"
                        ]
                    }
                }
            },
            "nodes": {
                "n0": {
                    "ids": [
                        "MONDO:0009061"
                    ],
                    "is_set": false
                },
                "n1": {
                    "categories": [
                        "biolink:ChemicalEntity"
                    ],
                    "is_set": false
                },
                "n2": {
                    "categories": [
                        "biolink:GeneOrGeneProduct"
                    ],
                    "is_set": false
                }
            }
        }
    }
}
maximusunc commented 8 months ago

Your query doesn't have the extended timeout that I'm able to set directly in Aragorn. So WFR is returning nothing because icees-kg is timed out on the first hop. This is a performance issue, and I'm only able to get results back because I can peek behind the curtain and turn some hidden knobs.

karafecho commented 8 months ago

Oh, I see. That makes sense.

In that case, perhaps you can send me the full response?

karafecho commented 8 months ago

Just so everyone is clear, the goal of this effort is three-fold:

  1. Team science - more tightly couple icees-kg and cam-kp under Exposures Provider.
  2. Scientific impact - leverage cam-kp to provide mechanistic insights into clinical observations derived from icees-kg (AOPs of scientific interest).
  3. Translator MVP2 queries - contribute to Translator MVP2 queries by leveraging the CQS and targeting the clinical KPs in the first hop and cam-kp in the second hop.
karafecho commented 7 months ago

Also see [this GitHub folder](https://github.com/NCATSTranslator/Clinical-Data-Committee-Tracking-Voting/tree/main/GetCreative()_DrugDiscoveryRepurposing_RarePulmonaryDisease/MVP2_Path_A) and slide 9 in this slide deck.

karafecho commented 6 months ago

Per decision on 01.03.2024: Max will rerun the above queries with extended timeouts in ARAGORN and cache the results. Kara will then test.

karafecho commented 5 months ago

From Meisha, 01/17/2024:

thumbnail

Title: Peptide Oxidation Leading to Hypertension

Description from the wiki:

Here we present the supporting information on an AOP describing how vascular endothelial peptide oxidation leads to hypertension via perturbation of endothelial nitric oxide (NO) bioavailability. The molecular initiating event is oxidation of amino acid (AA) residues on critical peptides of the NO pathway, notably protein kinase B (AKT), guanosine triphosphate cyclohydrolase-1 (GTPCH-1), endothelial nitric oxide synthase (eNOS), and also the cellular ROS scavenger; glutathione. Oxidation of the enzymic components of the pathway lead to reduced expression of the phosphorylated proteins, and protein loss via proteasomal degradation. Oxidation of reduced glutathione to GSSG promotes bonding of GSSG to critical AA residues on eNOS, and the reduced expression of GTPCH-1 reduces bioavailability of tetrahydrobiopterin (BH4), both of which lead to uncoupling of eNOS (reduced NO production, increased superoxide production). The combination of these molecular events lead to reduced bioavailabilty of NO, which in turn reduces the potential for vasodilation and shifts the balance of vascular tone towards vasoconstriction. Repeated perturbation of this pathway via chronic exposure to toxicants, ultimately increases vascular resistance and contributes towards the development of hypertension.

karafecho commented 5 months ago

From Max, 02/05/2024: cam_kp_integration_response.json - CF - ChemicalEntity - GeneOrGeneProduct

ChemicalEntity = propranolol

https://pubmed.ncbi.nlm.nih.gov/23539159/

https://www.uspharmacist.com/article/advances-in-the-management-of-cystic-fibrosis

https://www.journal-of-hepatology.eu/article/S0168-8278(15)00349-9/fulltext

gaurav commented 4 months ago

I took another stab at the CURIEs I couldn't figure out previously, and found three more of them in CAM-KP. Most of these are NodeNorm issues in one way or another, but at least one of them could be fixed by turning on drug conflation when processing CAM-KP. I propose we use the alternate CURIEs I listed below while I try to figure out the NodeNorm issues.

CURIE Normalized to Should actually be normalized to How many unique CURIEs is this connected to in Automat-CAM-KP?
CHEMBL.COMPOUND:CHEMBL1256818 PUBCHEM.COMPOUND:5462351 ("Dextromethorphan hydrobromide monohydrate") PUBCHEM.COMPOUND:5360696 ("Dextromethorphan") None, but should exist (see Dextromethorphan on CTD)
PUBCHEM.COMPOUND:165363555 ("Trifacta") Normalized N/A None
HMDB:HMDB0252416 ("Fluticasone") PUBCHEM.COMPOUND:4659387 ("Fluticasona [Spanish]") PUBCHEM.COMPOUND:5311101 ("Fluticasone") 88
PUBCHEM.COMPOUND:123600 ("Levalbuterol") Normalized N/A None
HMDB:HMDB0242500 ("Budesonide") PUBCHEM.COMPOUND:5281004 N/A 167
CHEBI:5147 ("Formoterol") PUBCHEM.COMPOUND:3410 ("Formoterol") PUBCHEM.COMPOUND:45358055 ("Foradil Certihaler"), but cliques to 3410 with drug_conflation turned on 53