Closed ravwojdyla closed 3 years ago
Hi, thanks a lot for this detailed report! The issue of having evidence string duplicates (in various ways) is a complex one, and something which we haven't completely figured out yet. Unfortunately I don't have the bandwidth to look into this in the near future, as I'm busy with other projects, but I will address your comments in detail as soon as I can. This will probably be some time in early 2021
Hi @tskir, thank you! Appreciate any help/guidance (even if that requires a PR from me).
Hi @ravwojdyla! Thank you for your patience.
The upcoming Open Targets release 21.04 will not include the unique_association_fields
attribute anymore due to changes in the JSON schema.
However, at the same time, all underlying problems which were causing evidence string duplication have been noted and fixed in our pipeline v2.0.0. I confirmed there are no duplicates in the generated data, and an upcoming PR will add more checks to ensure this always stays the case.
And to shed some light on what was happening, there were two kinds of issues, both related to string-to-ontology mapping:
Because the source disease name was not part of unique_association_fields
, this was causing duplicates. Under the new approach, this will be grouped under the same evidence string. For example (this is one of the records which you reported):
{"alleleOrigins": ["unknown"], "datasourceId": "eva", "datatypeId": "genetic_association", "clinicalSignificances": ["uncertain significance"], "confidence": "criteria provided, single submitter", "studyId": "RCV000763712", "targetFromSourceId": "ENSG00000165478", "variantFunctionalConsequenceId": "SO_0001583", "variantId": "11_124923824_G_A", "variantRsId": "rs746134081", "cohortPhenotypes": ["Megalencephalic leukoencephalopathy with subcortical cysts 1", "Megalencephalic leukoencephalopathy with subcortical cysts 2a", "Megalencephalic leukoencephalopathy with subcortical cysts 2b, remitting, with or without mental retardation"], "diseaseFromSource": "Megalencephalic leukoencephalopathy with subcortical cysts 1", "diseaseFromSourceId": "CN034246", "diseaseFromSourceMappedId": "Orphanet_2478"}
Note the cohortPhenotypes
field which includes the three original disease names mapped to the same EFO term.
This happened when the “database” (currently implemented using Google Sheets and TSV files, but we're working on it) of string-to-ontology mappings accidentally included duplicate rows. This caused each evidence string containing the term to also become fully duplicated. This has been resolved, and likewise an upcoming PR will add a check in our SOP to always verify this is the case before the submission.
Thank you very much for highlighting this back in December and providing such an insightful bug report! This issue will close once we have released version 2.0.1 with the minor fixes I'm currently working on; however, please feel free to submit any number of additional issues in the future should the need arise.
@tskir thanks for the update and the fixes! I understand the https://github.com/EBIvariation/eva-opentargets/issues/189 is the documentation of the schema change (or is there a better resource?).
The upcoming Open Targets release 21.04 will not include the unique_association_fields attribute anymore due to changes in the JSON schema.
Given that unique_association_fields
is gone, do you guarantee uniqueness of any sort? And how can that be asserted?
@ravwojdyla Yes, #189 is probably the best place which describes the changes specific to this pipeline. You may also wish to look at the Open Targets JSON schema release history: https://github.com/opentargets/json_schema/releases.
Regarding uniqueness, given the way the pipeline operates, each evidence string is guaranteed to be unique on the combination of four fields:
studyId
, e. g. RCV000015714targetFromSourceId
, e. g. ENSG00000186832variantFunctionalConsequenceId
, e. g. SO_0001818diseaseFromSourceMappedId
, e. g. Orphanet_2337@tskir thank you! Looking forward to get my hands on the upcoming release. And thanks for https://github.com/EBIvariation/eva-opentargets/issues/203!
Thanks for great work! I am curious if it's expected to have duplicates on the
unique_association_fields
. In 20.11 OTG EVA evidence I can see 2452 of them, for example:Looking into the difference in the duplicate entires, I see that they differ in the
disease
field. For example for "key":RCV000763712, ENSG00000165478 , http://www.orpha.net/ORDO/Orphanet_2478, rs746134081
, there are 3 duplicates that differ only on thedisease
:They are still the same phenotype/id, the
source_nama
is different.There also seem to be cases of complete duplicates like for the
(alleleOrigin='germline', clinvarAccession='RCV000073296', gene='ENSG00000134982', phenotype='http://www.ebi.ac.uk/efo/EFO_0005842', variant_id='rs76039388')
, there are 2 rows that seem to be exactly the same (there is more cases like that.To sum up, there seem to be:
unique_association_fields
disease
fieldIs this expected?