Open karafecho opened 1 year ago
Hi Kara. What happens here is that we send the specified search term to name resolver and that gives back curies that match. Once we have curies, we get the corresponding biolink categories from node normalizer. I think if we are wanting specific biolink categories, we will either need to update the search term to be something that gives back a curie that has the wanted categories or we need to hard-code the curie and/or biolink categories in the all_features yaml file, or even a mix of both. What are your thoughts?
Yeah, I understand the process, and I knew that some of the Biolink categories were being dropped when we started leveraging SRI services, but I wasn't really concerned until recently, when a use case arose. Specifically, ICEES KG is returning environmental exposures such as "benzene" in response to the first hop of Path A in the TCDC's workflow (see slide 10 here). This is introducing noise into the final answer set. As such, we would like to filter chemical exposures from the first hop using an exclude
edge, but we cannot do that (I don't think) without attaching a Biolink category such as biolink:Environmental Exposure
to non-drug biolink:ChemicalEntity
nodes. I had played around with the search terms to see if SRI supported environmental exposures, but I don't think those are represented. In some sense, this is a data modeling issue, but I'd like to identify a quick fix that will resolve the current issue. I am completely open to suggestions.
It's up to you. From my end, I would just need to rerun the precompute script after you update the all_features file.
Let's move forward with hard coding, as I think this will allow us to move in a more timely manner with the TCDC workflow and related Translator efforts. That said, let's hold off on running the precompute script until after it is updated to include new calculations (see #13, #14, #15, #16).
To clarify, the all_features YAML files already contain most of the intended Biolink mappings, although I would like to make a few adjustments for consistency. Shouldn't take long.
Update 11.14.2022:
This Node Norm endpoint returns the following output for three test inputs:
PUBCHEM.COMPOUND:2083 (albuterol)
"type": [
"biolink:SmallMolecule",
"biolink:MolecularEntity",
"biolink:ChemicalEntity",
"biolink:PhysicalEssence",
"biolink:ChemicalOrDrugOrTreatment",
"biolink:ChemicalEntityOrGeneOrGeneProduct",
"biolink:ChemicalEntityOrProteinOrPolypeptide",
"biolink:NamedThing",
"biolink:Entity",
"biolink:PhysicalEssenceOrOccurrent"
],
MESH:D052638 (particulate matter)
"type": [
"biolink:ComplexMolecularMixture",
"biolink:ChemicalMixture",
"biolink:ChemicalEntity",
"biolink:PhysicalEssence",
"biolink:ChemicalOrDrugOrTreatment",
"biolink:ChemicalEntityOrGeneOrGeneProduct",
"biolink:ChemicalEntityOrProteinOrPolypeptide",
"biolink:NamedThing",
"biolink:Entity",
"biolink:PhysicalEssenceOrOccurrent"
]
PUBCHEM.COMPOUND:241 (benzene')
],
"type": [
"biolink:SmallMolecule",
"biolink:MolecularEntity",
"biolink:ChemicalEntity",
"biolink:PhysicalEssence",
"biolink:ChemicalOrDrugOrTreatment",
"biolink:ChemicalEntityOrGeneOrGeneProduct",
"biolink:ChemicalEntityOrProteinOrPolypeptide",
"biolink:NamedThing",
"biolink:Entity",
"biolink:PhysicalEssenceOrOccurrent"
],
If I change the search terms by adding "exposure" for the last two variables above, here's what Node Norm outputs:
UMLS:C2136615 (airborne pollutant exposure)
"type": [
"biolink:PhenotypicFeature",
"biolink:DiseaseOrPhenotypicFeature",
"biolink:ThingWithTaxon",
"biolink:BiologicalEntity",
"biolink:NamedThing",
"biolink:Entity"
]
NCIT:C36251 (benzene exposure)
"type": [
"biolink:PhenotypicFeature",
"biolink:DiseaseOrPhenotypicFeature",
"biolink:ThingWithTaxon",
"biolink:BiologicalEntity",
"biolink:NamedThing",
"biolink:Entity"
],
So, Node Norm is now recognizing things like chemical exposures, BUT the mappings to biolink:ChemicalEntity
are lost, AND the mappings to biolink:PhenotypicFeature
seem a bit weird to me (especially when biolink:EnvironmentalExposure
is an option) but are okay-ish.
Decision: (1) Add biolink:EnvironmentalExposure
mappings to exposures that Node Norm returns. (2) Ask Biolink team about the mappings for chemical exposures (second set of examples above) and other types of exposures. (3) Address any downstream normalization issues with ICEES output when/if they arise.
Noting that the YAML files contain a number of Biolink mappings that are not supported by Node Norm. For instance:
{
"UMLS:C0019993": {
"id": {
"identifier": "UMLS:C0019993",
"label": "Hospitalization"
},
"equivalent_identifiers": [
{
"identifier": "UMLS:C0019993",
"label": "Hospitalization"
}
],
"type": [
"biolink:Activity",
"biolink:ActivityAndBehavior",
"biolink:NamedThing",
"biolink:Entity",
"biolink:Occurrent",
"biolink:PhysicalEssenceOrOccurrent"
]
},
"": null
}
I mapped "hospitalization" to biolink:ClinicalIntervention
, which seems more appropriate than the Node Norm mappings that are returned.
Updated decision / action items [assigned to Kara]:
[X] Supplement Node Norm Biolink category mappings with hand-curated mappings, defined within the all_feature YAML files, which are more appropriate for certain ICEES KG variables.
[X] Create a PR to merge the new YAML files after first validating them.
[X] Post a ticket to the Biolink team in order to solicit their expert opinion on questionable Node Norm mappings. See https://github.com/biolink/biolink-model/issues/1156.
Notes on supplemental Biolink mappings.
biolink:ChemicalEntity
and biolink:EnvironmentalExposure
. The first mapping is redundant with what Node Norm will return, but I think that's okay, as it provides a record for how I mapped prior to splitting ICEES into ICEES+ and ICEES KG, and leveraging SRI services for ICEES KG, rather than human curation, to provide the Biolink mappings.biolink:ComplexChemicalMixture
and biolink:EnvironmentalExposure
.biolink:EnvironmentalExposure
.biolink:ClinicalIntervention
.
This issue is to formally report a disconnect between the Biolink mappings that are included in the ICEES API all_features config files and those that support ICEES KG, as reported in the meta-KG. The approach that we've implemented to automate some of the work and leverage SRI services is not picking up certain intended Biolink mappings. For instance,
AvgDailyPM2.5Exposure
should map tobiolink:ChemicalEntity
andbiolink:EnvironmentalExposure
. To provide another example,TotalEDVisits
should map tobiolink:ClinicalIntervention
.