Port the logic of the trait mapping pipeline

tskir commented 4 years ago

This is a follow-up ticket for #30. The "trait mapping pipeline" in the existing repository uses ZOOMA and OxO to attempt to automatically map trait names to ontology terms, and outputs two files:

Automatic mappings, which are considered of good quality and not manually reviewed (under current approach, ever)
Mappings requiring manual curation, which are reviewed using spreadsheets.

The logic of that pipeline needs to be ported into the web app as well. One way this could work is that if a mapping is considered of good quality, it can be applied to the trait automatically, and its status can be moved directly into review (skipping the manual mapping part).

tskir commented 4 years ago

Clarifications about the scope of this ticket:

The only goal is to classify some mappings as "good quality" and move them straight into review, skipping the unmapped status.
To implement that, all functionality from the existing trait mapping pipeline will need to be ported, including:
- Fetching ZOOMA's fields which store confidence of the mapping
- Querying OxO to discover more mappings through cross-ontology links

joj0s commented 4 years ago

I am not really sure I understand the OxO query part. What do i query against and how will those mappings found help in the mapping functionality?

Also since I am getting ZOOMA confidence information as part of this issue, should we include another field in the MappingSuggestion model to hold that, or should I just check the mapping confidence during the process of ZOOMA queries, and then discard that field?

tskir commented 4 years ago

I am not really sure I understand the OxO query part. What do i query against and how will those mappings found help in the mapping functionality?

Honestly, I'm not really sure I completely understand OxO myself :) Its use predates my involvement in this project by a long shot. Basically it's just another service in the SPOT stack. Much like ZOOMA and OLS, It has a web interface which you can explore to get a sense on what it's doing: https://www.ebi.ac.uk/spot/oxo/, and the corresponding APIs.

Briefly, the idea of ZOOMA is to map (trait names) to (ontology terms from some ontologies), and the idea of OxO is to query (the ontologies which ZOOMA returned) and get (more ontologies which are cross-referenced from the query ones).

For a concrete example, imagine ZOOMA provides a mapping from "some disease description" to UO:123456, where UO is some ontology useless to us, as it's neither EFO nor something which can be imported into EFO. However, the UO:123456 term might include some cross-links to ontologies which are useful to us. So we query OxO with "UO:123456" and hope it returns something like "EFO:111111" or "MONDO:222222" as cross-links.

I think in the current code OxO is only called when ZOOMA returned no useful results, or something like that. This logic should be ported verbatim for now.

tskir commented 4 years ago

Oh, actually, I'm not sure my description is entirely correct. It may be that we're not running OxO on ZOOMA's results, but only query the ontology terms which are already specified in ClinVar. You see, ClinVar data actually includes some ontologies already cross-linked for some terms, although they are for the most part from OMIM and NCIT ontologies, which are not directly useful for us. So it may be that OxO is only used for those cases. Could you investigate and see which is it?

tskir commented 4 years ago

Also since I am getting ZOOMA confidence information as part of this issue, should we include another field in the MappingSuggestion model to hold that, or should I just check the mapping confidence during the process of ZOOMA queries, and then discard that field?

Good question. The mapping quality field is of limited use for a curator, so I say use it to make a decision and then discard. At least for now.

joj0s commented 4 years ago

# Process a single trait. Find any mappings in Zooma. If there are no high confidence Zooma mappings that are in EFO then query OxO with any high confidence mappings not in EFO.

This is what the current trait mapping pipeline in the Open Targets repository does. Should we follow the same procedure?

tskir commented 4 years ago

This is what the current trait mapping pipeline in the Open Targets repository does. Should we follow the same procedure?

Ah, I see, so I was right the first time around with doing post-processing for ZOOMA results using OxO. Yes, for now let's follow this same procedure

joj0s commented 4 years ago

@tskir So this just occured to me while working with ZOOMA and OxO for this issue. The filters I am applying to ZOOMA queries right now are the following: required:[none],ontologies:["efo,mondo,ordo,hp"] .

The required:[none] part means that ZOOMA skips searching through its already curated datasources and searches straight through OLS to find possible mappings. By playing with it a bit more, if I don't provide that, it never goes through the process of querying OLS for the ontologies specified. What we have now is what we want right?

And I am asking because I am now working on the process of querying for all ontologies if there are no 'HIGH' confidence mappings in the appropriate ontologies, and then finding cross refs through OxO. The problem is that the majority of 'HIGH' mappings come from curated datasources, not OLS. So should the second query be made to curated datasources and then run the results through OxO?

tskir commented 4 years ago

Just to verify that I got this right: what you're saying is that if we specify required:[none] and force ZOOMA to never rely on curated datasources, it will almost always not return HIGH quality mappings, right?

Those are all good questions, and I think it would help to run this process for all (or a large number of) traits and see what are the actual distributions of the results we're getting with different combinations of flags. Could you please either do it in this issue, or create a separate one?

Also, could you remind me where we got the required:[none] part from? Is it used this way in the existing code from the eva-opentargets (formerly eva-cttv-pipeline) repository? I think an optimal approach would be to still allow ZOOMA to use the curated datasources, except the ones we provide ourselves (clinvar-xrefs and eva-clinvar), to avoid circular dependencies.

joj0s commented 4 years ago

Just to verify that I got this right: what you're saying is that if we specify required:[none] and force ZOOMA to never rely on curated datasources, it will almost always not return HIGH quality mappings, right?

As far as I have tested yes, I haven't found 'HIGH' quality mappings, although I will need to run that with a bigger trait list locally and follow up on that.

Is it used this way in the existing code from the eva-opentargets (formerly eva-cttv-pipeline) repository?

The current zooma request code accepts user provided filters for this attribute and I haven't yet found which filters are used when these requests are made. I also agree that I should add these datasources.

tskir commented 4 years ago

Got it. About this part:

I haven't yet found which filters are used when these requests are made

The default values are being specified using an arg parser in the wrapper script, here's where it's done: https://github.com/EBIvariation/eva-opentargets/blob/master/bin/trait_mapping.py#L32

joj0s commented 4 years ago

In the case when we find a 'HIGH' confidence mapping, but that mapping already exists in the database (and it is not the trait's current mapping), what should we do with it? Should we map it again or should we just skip that automatic mapping and find other ones?

Or maybe we should skip the process of trying to find automatic mappings altogether if a trait is currently unmapped but has a past mapping? I am already skipping this process if a trait is not unmapped of course.

joj0s commented 4 years ago

I've also encountered this case: https://www.ebi.ac.uk/spot/zooma/v2/api/services/annotate?propertyValue=Congenital+muscular+dystrophy-dystroglycanopathy+with+mental+retardation,+type+B1&filter=required:[cttv,sysmicro,atlas,ebisc,uniprot,gwas,cbi]

Notice how it returns only one suggestion object but with two IRIs. I'm not sure why that is and how I should treat it.

joj0s commented 4 years ago

I am also having trouble making successful requests to the OxO API. At first I tried looking through the documentation but most of it is broken as it seems.

Then I tried following the mapping pipeline https://github.com/EBIvariation/eva-opentargets/blob/aae028daacc0dd98026fb4b446afe7e869e1f791/eva_cttv_pipeline/trait_mapping/oxo.py#L138 and the default arguments provided in https://github.com/EBIvariation/eva-opentargets/blob/aae028daacc0dd98026fb4b446afe7e869e1f791/bin/trait_mapping.py#L38, but I keep getting 'Bad request' errors.

Could you provide an example request so that I can figure that out?

tskir commented 4 years ago

In the case when we find a 'HIGH' confidence mapping, but that mapping already exists in the database (and it is not the trait's current mapping), what should we do with it? Should we map it again or should we just skip that automatic mapping and find other ones?

Or maybe we should skip the process of trying to find automatic mappings altogether if a trait is currently unmapped but has a past mapping?

Since there can be several scenarios which can lead to this case, I'm thinking we shouldn't apply any special logic and, provided that the trait is currently unmapped, just automatically map it to this high quality suggestion as usual. We might change this logic in the future based on usability testing on real data, but not right now.

I've also encountered this case [...] Notice how it returns only one suggestion object but with two IRIs. I'm not sure why that is and how I should treat it.

Oh... Right... This is one of a few very rare cases where a trait was deliberately mapped to two different ontology terms, which kind of combine and together make up that one trait. We discussed this with Open Targets many months ago. What you should do at this time is skip such cases and log them with a WARN level. I'll escalate this with Open Targets to see if we could get rid of such situations, which in my opinion we absolutely should. Created issue for that: #68.

I am also having trouble making successful requests to the OxO API. At first I tried looking through the documentation but most of it is broken as it seems. [...] Could you provide an example request so that I can figure that out?

Before we dive into any further investigation, can you make sure that you're doing POST requests? They are the only ones which will work for search queries, OxO will not accept GET for that purpose.

joj0s commented 4 years ago

Before we dive into any further investigation, can you make sure that you're doing POST requests? They are the only ones which will work for search queries, OxO will not accept GET for that purpose.

Yes, I am doing POST requests and providing the arguments as part of the JSON payload

tskir commented 4 years ago

Turns out documentation being partially broken is a known issue, they'll fix it in the next update: https://github.com/EBISPOT/OXO/issues/35

Regarding the requests to OxO, something like this works for me:

import requests
url = 'https://www.ebi.ac.uk/spot/oxo/api/search?size=5000'
payload = {}
payload['ids'] = ['MESH:D009202']
payload['mappingTarget'] = 'Orphanet,efo,hp,mondo'
payload['distance'] = 3
result = requests.post(url, data=payload)
print(json.dumps(json.loads(result.text), indent=4, sort_keys=True))

Show response

```json { "_embedded": { "searchResults": [ { "_links": { "mappings": { "href": "https://www.ebi.ac.uk/spot/oxo/api/mappings?fromId=MeSH:D009202" }, "self": { "href": "https://www.ebi.ac.uk/spot/oxo/api/terms/MeSH:D009202" } }, "curie": "MeSH:D009202", "label": "Cardiomyopathies", "mappingResponseList": [ { "curie": "MONDO:0005110", "distance": 3, "label": "idiopathic cardiomyopathy", "sourcePrefixes": [ "ONTONEO", "BAO", "MONDO", "DOID", "UMLS" ], "targetPrefix": "MONDO" }, { "curie": "MONDO:0021570", "distance": 3, "label": "Hauptmann-Thannhauser muscular dystrophy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "MONDO:0005045", "distance": 3, "label": "hypertrophic cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "MONDO:0016587", "distance": 3, "label": "arrhythmogenic right ventricular cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "EFO:0000407", "distance": 3, "label": "dilated cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "EFO" }, { "curie": "MONDO:0016340", "distance": 3, "label": "familial restrictive cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "EFO:0000538", "distance": 3, "label": "hypertrophic cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "EFO" }, { "curie": "MONDO:0010542", "distance": 3, "label": "dilated cardiomyopathy 3B", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "Orphanet:217635", "distance": 3, "label": "Familial restrictive cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "Orphanet" }, { "curie": "EFO:0000318", "distance": 1, "label": "cardiomyopathy", "sourcePrefixes": [ "EFO" ], "targetPrefix": "EFO" }, { "curie": "EFO:0000767", "distance": 3, "label": "idiopathic cardiomyopathy", "sourcePrefixes": [ "ONTONEO", "BAO", "EFO", "DOID", "UMLS" ], "targetPrefix": "EFO" }, { "curie": "MONDO:0005021", "distance": 3, "label": "dilated cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "MONDO:0004994", "distance": 1, "label": "cardiomyopathy", "sourcePrefixes": [ "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "MONDO:0005217", "distance": 3, "label": "familial cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "HP:0001638", "distance": 1, "label": "Cardiomyopathy", "sourcePrefixes": [ "HP" ], "targetPrefix": "HP" }, { "curie": "Orphanet:167848", "distance": 1, "label": "Cardiomyopathy", "sourcePrefixes": [ "Orphanet" ], "targetPrefix": "Orphanet" }, { "curie": "EFO:0002630", "distance": 3, "label": "restrictive cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "EFO" }, { "curie": "MONDO:0005201", "distance": 3, "label": "restrictive cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "MONDO" }, { "curie": "EFO:0002945", "distance": 3, "label": "familial cardiomyopathy", "sourcePrefixes": [ "EFO", "MONDO" ], "targetPrefix": "EFO" } ], "queryId": "MESH:D009202", "querySource": null } ] }, "_links": { "self": { "href": "https://www.ebi.ac.uk/spot/oxo/api/search" } }, "page": { "number": 0, "size": 1000, "totalElements": 1, "totalPages": 1 } } ```

Most likely the problem is that you (quite understandably) try to query OxO with full IRIs. However, it can't take those. Instead, you have to convert the IRIs into the CURIE-like identifiers, like "MESH:D009202" in my example. This is where it's being done in the code.

joj0s commented 4 years ago

I see, thanks for the clarification. So I see this returns quite a few results. Should I add them all to the suggestion list? And then which one do I choose to automatically map?

tskir commented 4 years ago

We discussed this during the catch-up call on 2020-07-31, but I will repeat this here for the record:

We should decrease the mapping distance from 3 to 2, or even to 1. This will reduce the number of hits returned.
OxO results should not be used for doing the mappings automatically

joj0s commented 4 years ago

I am kind of having trouble testing the OxO functionality. I am parsing 1000 ClinVar records now (as opposed to the usual 200) and I still can't find any trait that has a highly confident suggestion that is not in the appropriate ontologies. Do you happen to have any example of such a case?

tskir commented 4 years ago

Do you happen to have any example of such a case?

I don't have a specific example; but actually, when I think about this, it looks like the logic of the existing implementation (which we also copied into this project) is faulty:

We query ZOOMA and say we only want the results from the acceptable ontologies (EFO, Orphanet, MONDO, HP)
In the results, we try to find a suggested term which is not from those ontologies and then run it through OxO

Looking at this, it seems like our attempts to find such a term are doomed to fail. So maybe what we should do instead is to run ZOOMA without ontology restrictions, then manually filter by the list of acceptable ones, and if none, run through OxO on the results (which now would include non-acceptable ones).

joj0s commented 4 years ago

We query ZOOMA and say we only want the results from the acceptable ontologies

Actually this is not the case anymore. As we discussed, I am now making two ZOOMA queries, one for terms in manually curated sources from all ontologies, and one in OLS for the required ontologies. After all, the ontology filter only applies to OLS anyway. 'HIGH' condifence mappings always come from the first query, but all 'HIGH' confidence mappings I found so far 1000 ClinVar records are in the appropriate ontologies already.

tskir commented 4 years ago

After all, the ontology filter only applies to OLS anyway.

I don't think this is the case, though. Keep in mind that ZOOMA and OLS are two separate services, and when we put an ontologies:[efo,mondo,hp,ordo] filter into the ZOOMA request, then it should, at least in theory, limit ZOOMA's output to only those ontologies. This is the place in the code where this request is being made: https://github.com/EBIvariation/trait-curation/blob/532ef37e694fb401f32117b456881daba3fef81f/traitcuration/traits/datasources/zooma.py#L51

Could you try and experiment with it, remove the filter from the query, and see if it changes the output?

joj0s commented 4 years ago

&filter=required:[none],ontologies:[efo,mondo,hp,ordo]")

Well this is the problem we had and why we said we would make two queries instead of one. By setting the required filter to none, what we are doing is skipping the curated datasources entirely, and just searching in OLS for the defined ontologies.

To define the source(s) you want Zooma to search in, use the Database name in the 'required:[]' field. e.g. use 'required:[cttv]' to look into OpenTargets.

The 'ontologies:[none]' parameter will restrain Zooma from looking in the OLS if no annotation was found

These are taken from ZOOMA documentation. So basically, the required filter defines which curated sources to query, and the ontologies filter defines which ontologies to query only in OLS. Also keep in mind that if ZOOMA finds a mapping suggestion in a curated source, it doesn't query OLS at all.

In the updated pipeline, we are doing two queries. One with only the required filter set, using all sources except for ClinVar and ClinVar xrefs. And a second one with 'required' set to none, and 'ontologies' set to the appropriate ontology ids.

joj0s commented 4 years ago

So to come back to my original problem, I am having trouble testing the OxO functionality because the first query has never returned a 'high' confidence mapping that was not in a suitable ontology.

tskir commented 4 years ago

In the updated pipeline, we are doing two queries. One with only the required filter set, using all sources except for ClinVar and ClinVar xrefs. And a second one with 'required' set to none, and 'ontologies' set to the appropriate ontology ids.

Okay, right, now I think I get it. Your explanation in the comment above is really useful, could you add a markdown file somewhere in docs/ with this?

So to come back to my original problem, I am having trouble testing the OxO functionality because the first query has never returned a 'high' confidence mapping that was not in a suitable ontology.

I'm not entirely sure, but it might be because all curated sources which ZOOMA uses probably come from EBI or associated teams, and they probably all try to stay within EFO/MONDO/ORDO/HP list. This would explain why the first query against curated datasources doesn't return any results not from those ontologies. I see two solutions here:

We can enable the clinvar_xrefs datasource in the first query. This will almost exclusively contain NCIT and OMIM terms, taken directly from ClinVar, which would be high confidence, but are not acceptable ontologies. So this is where you could test the OxO functionality.
- Actually, we should probably include clinvar_xrefs in the list of sources for the first query permanently, not just for testing. The reason we exclude eva_clinvar is because it would create a circular dependency, where we submit the results of manual curation and use them again, coming from ZOOMA. However, the data in clinvar_xrefs comes directly from ClinVar data, not affected by our manual curation. So if we enable this dataset, then ZOOMA would occasionaly return results from it, which we could then map using OxO to something from an acceptable ontology.
Also, we could simply remove the ontologies restriction from the second ZOOMA query, so that it returns all suitable terms found through OLS, and then we could also apply OxO in this case as well.

Let me know if something about this still doesn't make sense, I'll be happy to elaborate. Also, we can discuss this during the today's call.

joj0s commented 4 years ago

We can enable the clinvar_xrefs datasource in the first query. This will almost exclusively contain NCIT and OMIM terms, taken directly from ClinVar, which would be high confidence, but are not acceptable ontologies. So this is where you could test the OxO functionality.

Sounds good, clinvar_xrefs does return a lot of 'HIGH' confidence mappings

Also, we could simply remove the ontologies restriction from the second ZOOMA query, so that it returns all suitable terms found through OLS, and then we could also apply OxO in this case as well.

I think this could make the suggestion list too extensive. Some OLS terms already seem pretty unrelated to their suggested trait names already.

joj0s commented 4 years ago

I have enabled clinvar_xrefs and I ran into another problem. I got the term C2711754 suggested with high confidence, which points to here https://www.ncbi.nlm.nih.gov/medgen/C2711754. I am not sure what ontology this belongs to. First of all it doesn't follow the usual naming scheme of the other ontology terms so I can't extract it automatically, and also there is no 'medgen' entry in OLS. I also tried querying both this iri and identifier against all OLS terms but got nothing back.

joj0s commented 4 years ago

I have now parsed over 5000 ClinVar records but I still can't find any non Medgen terms with high confidence.

Maybe I should just include the rest of the automatic mapping pipeline plus the basic oxo module code in a separate PR so that we can have that functionality available until this is figured out?

joj0s commented 4 years ago

Update on this one. I used the following code to test the OxO functionality:

def test():
    t = Trait.objects.first()
    find_automatic_mapping(trait=t, created_terms=[], high_confidence_term_iris=['http://purl.obolibrary.org/obo/NCIT_C2985'])

Basically since I can't find any HIGH confidence mappings that are not in compatible ontologies, I provided a suggested NCIT term of my own for a trait. The app successfully creates new mapping suggestions based on OxO results, however I haven't yet been able to test it with the complete workflow of the app due to that reason. So should I open the PR with the complete automatic mapping functionality including the OxO code?

EBIvariation / trait-curation

Port the logic of the trait mapping pipeline #44