data is being overwritten during the mondo id resolving step

biothings / mydisease.info

7 stars 8 forks source link

data is being overwritten during the mondo id resolving step #43

Closed colleenXu closed 5 months ago

colleenXu commented 3 years ago

This was noticed in the HPO parser (and it's unclear if this is an issue with other parsers). The HPO parser has disease-phenotype data, with diseases having OMIM, orphanet, or decipher IDs.

HPO annotations do not appear to have ID resolution, so the "same" disease can have different annotations to their OMIM id compared to their orphanet ID or their decipher ID.

However, when the mondo ID resolving step is done, only 1 ID's annotations are kept (the priority list is omim first, orphanet, decipher). This means the other data is lost/missing from the API's output.

andrewsu commented 3 years ago

@colleenXu can you provide an example please?

colleenXu commented 3 years ago

I've shown the missing data issue using this notebook to compare the partially-processed data to what's in mydisease right now.

The code that causes some of the data to be missed (keeping annotations from only one mapped ID) is probably the if-elif-else here

However, as shown in the last section of my notebook, the solution isn't as simple as merging the records, because each disease-phenotype annotation has different references, evidence type, biocuration, frequency, etc. included with it.

colleenXu commented 8 months ago

@everaldorodrigo @andrewsu @newgene

This would be a useful issue to address, but I don't know if it's in-scope for Everaldo to work on

newgene commented 8 months ago

assigned to @DylanWelzel to confirm with @colleenXu if this is still an issue.

colleenXu commented 7 months ago

Update: confirmed with Dylan that this is an issue last Friday (2/16). Dylan is working on a fix, and we discussed it more on 2/22.

colleenXu commented 5 months ago

@DylanWelzel asked me to review and close this issue.

Based on on our convos and a quick check of the deployed API, I think this has been successfully addressed. Here's an example: Temtamy syndrome in MyDisease

now both the OMIM and orphanet disease mappings show up (hpo.omim and hpo.orphanet fields)
phenotype_related_to_disease now contains phenotypes from both the OMIM and orphanet data
- right now, there's 59 total = 34 from OMIM + 25 from orphanet

I also see the extra adjustments:

other fields (clinical_course / inheritance / clinical_modifier) have been adjusted to provide all info, similar to the pheno fields.

So I think this is good and I'm closing the issue.