ExposuresProvider / icees-kg

Integrated Clinical and Environmental Exposures Service (ICEES) Knowledge Graph
0 stars 0 forks source link

ICEES KG meta-KG and Biolink mappings #12

Open karafecho opened 1 year ago

karafecho commented 1 year ago

This issue is to formally report a disconnect between the Biolink mappings that are included in the ICEES API all_features config files and those that support ICEES KG, as reported in the meta-KG. The approach that we've implemented to automate some of the work and leverage SRI services is not picking up certain intended Biolink mappings. For instance, AvgDailyPM2.5Exposure should map to biolink:ChemicalEntity and biolink:EnvironmentalExposure. To provide another example, TotalEDVisits should map to biolink:ClinicalIntervention.

maximusunc commented 1 year ago

Hi Kara. What happens here is that we send the specified search term to name resolver and that gives back curies that match. Once we have curies, we get the corresponding biolink categories from node normalizer. I think if we are wanting specific biolink categories, we will either need to update the search term to be something that gives back a curie that has the wanted categories or we need to hard-code the curie and/or biolink categories in the all_features yaml file, or even a mix of both. What are your thoughts?

karafecho commented 1 year ago

Yeah, I understand the process, and I knew that some of the Biolink categories were being dropped when we started leveraging SRI services, but I wasn't really concerned until recently, when a use case arose. Specifically, ICEES KG is returning environmental exposures such as "benzene" in response to the first hop of Path A in the TCDC's workflow (see slide 10 here). This is introducing noise into the final answer set. As such, we would like to filter chemical exposures from the first hop using an exclude edge, but we cannot do that (I don't think) without attaching a Biolink category such as biolink:Environmental Exposure to non-drug biolink:ChemicalEntity nodes. I had played around with the search terms to see if SRI supported environmental exposures, but I don't think those are represented. In some sense, this is a data modeling issue, but I'd like to identify a quick fix that will resolve the current issue. I am completely open to suggestions.

maximusunc commented 1 year ago

It's up to you. From my end, I would just need to rerun the precompute script after you update the all_features file.

karafecho commented 1 year ago

Let's move forward with hard coding, as I think this will allow us to move in a more timely manner with the TCDC workflow and related Translator efforts. That said, let's hold off on running the precompute script until after it is updated to include new calculations (see #13, #14, #15, #16).

karafecho commented 1 year ago

To clarify, the all_features YAML files already contain most of the intended Biolink mappings, although I would like to make a few adjustments for consistency. Shouldn't take long.

karafecho commented 1 year ago

Update 11.14.2022:

This Node Norm endpoint returns the following output for three test inputs:

PUBCHEM.COMPOUND:2083 (albuterol)

    "type": [
      "biolink:SmallMolecule",
      "biolink:MolecularEntity",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ],

MESH:D052638 (particulate matter)

   "type": [
      "biolink:ComplexMolecularMixture",
      "biolink:ChemicalMixture",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ]

PUBCHEM.COMPOUND:241 (benzene')

    ],
    "type": [
      "biolink:SmallMolecule",
      "biolink:MolecularEntity",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ],

If I change the search terms by adding "exposure" for the last two variables above, here's what Node Norm outputs:

UMLS:C2136615 (airborne pollutant exposure)

   "type": [
      "biolink:PhenotypicFeature",
      "biolink:DiseaseOrPhenotypicFeature",
      "biolink:ThingWithTaxon",
      "biolink:BiologicalEntity",
      "biolink:NamedThing",
      "biolink:Entity"
    ]

NCIT:C36251 (benzene exposure)

    "type": [
      "biolink:PhenotypicFeature",
      "biolink:DiseaseOrPhenotypicFeature",
      "biolink:ThingWithTaxon",
      "biolink:BiologicalEntity",
      "biolink:NamedThing",
      "biolink:Entity"
    ],

So, Node Norm is now recognizing things like chemical exposures, BUT the mappings to biolink:ChemicalEntity are lost, AND the mappings to biolink:PhenotypicFeature seem a bit weird to me (especially when biolink:EnvironmentalExposure is an option) but are okay-ish.

Decision: (1) Add biolink:EnvironmentalExposure mappings to exposures that Node Norm returns. (2) Ask Biolink team about the mappings for chemical exposures (second set of examples above) and other types of exposures. (3) Address any downstream normalization issues with ICEES output when/if they arise.

karafecho commented 1 year ago

Noting that the YAML files contain a number of Biolink mappings that are not supported by Node Norm. For instance:

{
  "UMLS:C0019993": {
    "id": {
      "identifier": "UMLS:C0019993",
      "label": "Hospitalization"
    },
    "equivalent_identifiers": [
      {
        "identifier": "UMLS:C0019993",
        "label": "Hospitalization"
      }
    ],
    "type": [
      "biolink:Activity",
      "biolink:ActivityAndBehavior",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:Occurrent",
      "biolink:PhysicalEssenceOrOccurrent"
    ]
  },
  "": null
}

I mapped "hospitalization" to biolink:ClinicalIntervention, which seems more appropriate than the Node Norm mappings that are returned.

karafecho commented 1 year ago

Updated decision / action items [assigned to Kara]:

karafecho commented 1 year ago

Notes on supplemental Biolink mappings.

  1. Airborne pollutants were mapped to biolink:ChemicalEntity and biolink:EnvironmentalExposure. The first mapping is redundant with what Node Norm will return, but I think that's okay, as it provides a record for how I mapped prior to splitting ICEES into ICEES+ and ICEES KG, and leveraging SRI services for ICEES KG, rather than human curation, to provide the Biolink mappings.
  2. All landfill, CAFO, and roadway variables (except for roadway type) were mapped to biolink:ComplexChemicalMixture and biolink:EnvironmentalExposure.
  3. All socio-economic exposures (ACS variables) were mapped to biolink:EnvironmentalExposure.
  4. All variables related to clinical interventions (e.g., hospitalization, hospital LOS, ventilation, convalescent plasma, supplemental oxygen) were mapped to, well, biolink:ClinicalIntervention.