ExposuresProvider / icees-api

MIT License
2 stars 8 forks source link

ICEES KG - feature variables that map to same CURIEs #237

Open karafecho opened 2 years ago

karafecho commented 2 years ago

This issue relates to the fact that ICEES KG contains feature variables that map to the same curies. For instance, AsthmaDx (1+ diagnoses of asthma over study period) D28D_ASTHMA_ER_VISIT_12M (asthma at time of survey), and D28D_ASTHMA_ER_VISIT_12M (ER visits for asthma with asthma as core concept and ER visits as qualifier) all map to MONDO:0004979. Yet, this query biolink:ChemicalEntity related_to biolink:Disease (MONDO:0004979) only return results for D28D_ASTHMA_ER_VISIT_12M.

The issue relates in part to how clinical feature variables are represented in Translator/Biolink. In the example above, 'asthma' is the core concept for all three feature variables, which are distinguished primarily by qualifiers. However, ICEES KG should be able to return results for any feature variables with 'asthma' as the core concept in the query graph.

karafecho commented 2 years ago

Note this issue probably should have been posted here: https://github.com/ExposuresProvider/icees-kg.

maximusunc commented 2 years ago

I think I've finally gotten to the bottom of this. In the all_features_yaml_file, the name lookup of AsthmaDx is "asthma diagnosis", and when you run that through name resolver, MONDO:0004979 is not in that list of curies. I don't believe this is an issue in ICEES KG.

karafecho commented 2 years ago

Hmmm. I tested the search terms against both Name Resolver and Node Normalizer. Also, the test queries that I ran used MONDO identifiers that worked previously.

I think this may actually be a Name Resolver issue? For instance, the search term I included for CysticFibrosisDx is "cystic fibrosis diagnosis". If you look up the CURIE(s) in Name Resolver, MONDO:0009061 is correctly returned.

FWIW, I attempted to include search terms that somewhat accurately reflect the intended semantics of each feature variable without returning a gazillion results in Name Resolver. I put a limit of 75 of returns for "asthma diagnosis", but Name Resolver is returning much fewer than that right now. Do you think it's possible that something about the service changed?

Also, is it possible to see the CURIE(s) that are mapped to ICEES feature variables in ICEES KG? I'm thinking that might help with troubleshooting.

maximusunc commented 2 years ago

If name resolver was returning the correct MONDO before and now it isn't, we should probably open an issue there.

You can see all the mappings via the edges, as each one has a subject and object curie that map to the subject_feature_name and object_feature_name. Though looking at all the edges might be a little tedious. I could run a script locally and give you a file with all the mappings if you would prefer.

karafecho commented 2 years ago

I just used Name Resolver to test whether the inclusion of "diagnosis" in several search terms returns the intended CURIEs. Apart from "asthma diagnosis", Name Resolver works as expected. For example, "idiopathic bronchiectasis diagnosis" returns the following:

{
  "MONDO:0018956": [
    "Idiopathic bronchiectasis (diagnosis)",
    "idiopathic bronchiectasis",
    "bronchiectasis idiopathic",
    "Idiopathic bronchiectasis",
    "Idiopathic bronchiectasis (disorder)"
  ]
}

MONDO includes "diagnosis" in their definitions for diseases, including asthma, so I'm pretty sure Name Resolver should be returning MONDO:0004979 in response to "asthma diagnosis", and I'm pretty sure it did previously. I'll post a ticket.

karafecho commented 2 years ago

WRT the identifier mappings, no need to write a script. I just thought there might be an existing way to readily review these.