EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
19 stars 10 forks source link

unique_association_fields duplicates #185

Closed ravwojdyla closed 3 years ago

ravwojdyla commented 3 years ago

Thanks for great work! I am curious if it's expected to have duplicates on the unique_association_fields. In 20.11 OTG EVA evidence I can see 2452 of them, for example:

alleleOrigin  clinvarAccession  gene             phenotype                                  variant_id  
germline      RCV000763712      ENSG00000165478  http://www.orpha.net/ORDO/Orphanet_2478    rs746134081     3.0
              RCV000764382      ENSG00000135902  http://www.orpha.net/ORDO/Orphanet_590     rs563035306     3.0
              RCV000764381      ENSG00000135902  http://www.orpha.net/ORDO/Orphanet_590     rs142063328     3.0
              RCV000765367      ENSG00000108556  http://www.orpha.net/ORDO/Orphanet_590     rs757968612     3.0
              RCV000764380      ENSG00000135902  http://www.orpha.net/ORDO/Orphanet_590     rs201733876     3.0
                                                                                                           ... 
              RCV001080057      ENSG00000084754  http://www.orpha.net/ORDO/Orphanet_5       rs116320983     2.0
              RCV000630475      ENSG00000040531  http://www.orpha.net/ORDO/Orphanet_213     rs1555564051    2.0
              RCV001082778      ENSG00000215193  http://www.orpha.net/ORDO/Orphanet_79189   rs751507771     2.0
              RCV000691349      ENSG00000084754  http://www.orpha.net/ORDO/Orphanet_5       rs200715496     2.0
              RCV000875186      ENSG00000188037  http://www.orpha.net/ORDO/Orphanet_206973  rs148132102     2.0
Length: 2452, dtype: float64

Looking into the difference in the duplicate entires, I see that they differ in the disease field. For example for "key": RCV000763712, ENSG00000165478 , http://www.orpha.net/ORDO/Orphanet_2478, rs746134081, there are 3 duplicates that differ only on the disease:

({'id': 'http://www.orpha.net/ORDO/Orphanet_2478',
  'name': 'Megalencephalic leukoencephalopathy with subcortical cysts',
  'source_name': 'megalencephalic leukoencephalopathy with subcortical cysts 2a'},
 {'id': 'http://www.orpha.net/ORDO/Orphanet_2478',
  'name': 'Megalencephalic leukoencephalopathy with subcortical cysts',
  'source_name': 'megalencephalic leukoencephalopathy with subcortical cysts 1'},
 {'id': 'http://www.orpha.net/ORDO/Orphanet_2478',
  'name': 'Megalencephalic leukoencephalopathy with subcortical cysts',
  'source_name': 'megalencephalic leukoencephalopathy with subcortical cysts 2b, remitting, with or without mental retardation'})

They are still the same phenotype/id, the source_nama is different.

There also seem to be cases of complete duplicates like for the (alleleOrigin='germline', clinvarAccession='RCV000073296', gene='ENSG00000134982', phenotype='http://www.ebi.ac.uk/efo/EFO_0005842', variant_id='rs76039388'), there are 2 rows that seem to be exactly the same (there is more cases like that.

To sum up, there seem to be:

Is this expected?

tskir commented 3 years ago

Hi, thanks a lot for this detailed report! The issue of having evidence string duplicates (in various ways) is a complex one, and something which we haven't completely figured out yet. Unfortunately I don't have the bandwidth to look into this in the near future, as I'm busy with other projects, but I will address your comments in detail as soon as I can. This will probably be some time in early 2021

ravwojdyla commented 3 years ago

Hi @tskir, thank you! Appreciate any help/guidance (even if that requires a PR from me).

tskir commented 3 years ago

Hi @ravwojdyla! Thank you for your patience.

The upcoming Open Targets release 21.04 will not include the unique_association_fields attribute anymore due to changes in the JSON schema.

However, at the same time, all underlying problems which were causing evidence string duplication have been noted and fixed in our pipeline v2.0.0. I confirmed there are no duplicates in the generated data, and an upcoming PR will add more checks to ensure this always stays the case.

And to shed some light on what was happening, there were two kinds of issues, both related to string-to-ontology mapping:

1. Multiple disease names mapping to the same EFO term within an RCV record

Because the source disease name was not part of unique_association_fields, this was causing duplicates. Under the new approach, this will be grouped under the same evidence string. For example (this is one of the records which you reported):

{"alleleOrigins": ["unknown"], "datasourceId": "eva", "datatypeId": "genetic_association", "clinicalSignificances": ["uncertain significance"], "confidence": "criteria provided, single submitter", "studyId": "RCV000763712", "targetFromSourceId": "ENSG00000165478", "variantFunctionalConsequenceId": "SO_0001583", "variantId": "11_124923824_G_A", "variantRsId": "rs746134081", "cohortPhenotypes": ["Megalencephalic leukoencephalopathy with subcortical cysts 1", "Megalencephalic leukoencephalopathy with subcortical cysts 2a", "Megalencephalic leukoencephalopathy with subcortical cysts 2b, remitting, with or without mental retardation"], "diseaseFromSource": "Megalencephalic leukoencephalopathy with subcortical cysts 1", "diseaseFromSourceId": "CN034246", "diseaseFromSourceMappedId": "Orphanet_2478"}

Note the cohortPhenotypes field which includes the three original disease names mapped to the same EFO term.

2. Duplicate string-to-ontology mappings in the database

This happened when the “database” (currently implemented using Google Sheets and TSV files, but we're working on it) of string-to-ontology mappings accidentally included duplicate rows. This caused each evidence string containing the term to also become fully duplicated. This has been resolved, and likewise an upcoming PR will add a check in our SOP to always verify this is the case before the submission.

Thank you very much for highlighting this back in December and providing such an insightful bug report! This issue will close once we have released version 2.0.1 with the minor fixes I'm currently working on; however, please feel free to submit any number of additional issues in the future should the need arise.

ravwojdyla commented 3 years ago

@tskir thanks for the update and the fixes! I understand the https://github.com/EBIvariation/eva-opentargets/issues/189 is the documentation of the schema change (or is there a better resource?).

The upcoming Open Targets release 21.04 will not include the unique_association_fields attribute anymore due to changes in the JSON schema.

Given that unique_association_fields is gone, do you guarantee uniqueness of any sort? And how can that be asserted?

tskir commented 3 years ago

@ravwojdyla Yes, #189 is probably the best place which describes the changes specific to this pipeline. You may also wish to look at the Open Targets JSON schema release history: https://github.com/opentargets/json_schema/releases.

Regarding uniqueness, given the way the pipeline operates, each evidence string is guaranteed to be unique on the combination of four fields:

ravwojdyla commented 3 years ago

@tskir thank you! Looking forward to get my hands on the upcoming release. And thanks for https://github.com/EBIvariation/eva-opentargets/issues/203!