EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit

Apache License 2.0

18 stars 10 forks source link

Migrate to JSON schema v2 #189

Closed tskir closed 3 years ago

tskir commented 3 years ago

Linked to https://www.ebi.ac.uk/panda/jira/browse/EVA-2324. Reported by Kostas Tsirigos on 2021-02-03 via email:

Dear EVA team,

We are very happy to share with you our new JSON schema which should be used from the upcoming release (21.04 – April 2021) onwards. There are major changes in the schema. We encourage you to start the migration process as early as possible and raise any questions or problems you might encounter. You will find more details specific to you and your resource below. The last version of the JSON schema can be found here, while the validator to use alongside can be accessed here.

The necessity for a new, rewritten, JSON schema is part of our rewrite project, which will also come into effect with the 21.04 platform release (April 2021). The rationale behind this change is to streamline the evidence data-model across data-providers and improve the performance of our pipelines. The data you provide to us will now have a flatter format and hopefully it will require a simpler code-base on your side. We encourage you to read the schema documentation to understand what each field is about.

Some key points regarding the new JSON schema that we would like to draw your attention to are the following:

· You might notice that the new flatter schema has a lot of fields, but are not yet available to your resource. If you think you have data for any of these fields, please contact us and we can make them available to you.

· JSON schema validator: we strongly encourage you to please indicate whether you have validated the evidence strings and with which version of the schema. Please try to use the latest version of the schema at all times. Also, let us know if you have any issues running the validator and we can assist you in the process.

· The unique_association_fields field does not exist in the JSON schema anymore. We will account for the uniqueness of the evidence based on some common policies. We hope this decision also simplifies the process on your side.

· We no longer support the target_id field. This term is reserved and will be added on our side. Instead, we expect from you to provide targetFromSourceId which corresponds to the internal identifier used on your resource to refer to gene/transcripts/proteins, whether is an Ensembl gene identifier, a Uniprot accession or an approved gene symbol by HGNC. The values you submit to us will be subsequently mapped to Ensembl gene identifiers using a common pipeline across datasources.

· In an analogous manner, the disease_id field is not available anymore. We encourage you to populate the field diseaseFromSourceMappedId with the result of your mapping/curation to EFO. Additionally, some other fields related to disease/phenotype/traits characterisation are available to you. diseaseFromSourceId can be populated with any internal ID your resource uses to refer to diseases/phenotypes/traits (e.g. Mesh, ICD, etc). Importantly, this field should not be the result of an additional mapping process, just the raw representation of the trait. Similarly, if a raw description of the disease/phenotype/trait is available, this can be made available in the field diseaseFromSource. Ultimately, if there are a set of phenotypes that characterise the set of individuals the evidence is based on, this can be available in cohortPhenotypes.

· To illustrate the migration process, we provide below (also attached to the email) an example of how a JSON object from EVA looked before and after migrating it to the new schema.

Some new data we would like you to include in the JSON following the schema:

diseaseFromSourceId - this field shall include, if present, the disease ID captured at source by the EVA team during the curation process. We ignore whether there is any standardised disease ID that you use internally. If such an ID is not available, please ignore this comment.
cohortPhenotypes - as discussed in issue https://github.com/opentargets/platform/issues/1013, this field shall include an array with all the different phenotypes describing the RCV.
diseaseFromSource - as discussed in issue https://github.com/opentargets/platform/issues/1013, this field shall include the first phenotype from the cohortPhenotypes array in alphabetical order. While this decision might look arbitrary, we believe it is enough to have a string representation of the disease.
variantId - this field shall include an identifier of the variant using CHROM_POS_REF_ALT notation.
variantFunctionalConsequenceId - this field is currently absent in eva_somatic evidence. We would like to include the field similar to the eva datasource including the functional consequence of the variant.

We have attempted to cover the issues opened in EVA and OT platform github trackers. However, there might still be pending issues to consider that might not be totally resolved by this work. For complex issues, we would encourage you to migrate the evidence first and we can iterate on further improvements at a later stage.

We are at your disposal to further discuss the schema, should you have any questions that we can resolve and/or suggest fields (data) that we can be included on top of the ones we have already.

Best regards,

Kostas Tsirigos on behalf of the Open Targets team

tskir commented 3 years ago

Before

EVA germline

{
  "type": "genetic_association",
  "access_level": "public",
  "sourceID": "eva",
  "variant": {
    "type": "snp single",
    "id": "http://identifiers.org/dbsnp/rs1021025464"
  },
  "validated_against_schema_version": "1.7.3",
  "disease": {
    "id": "http://www.orpha.net/ORDO/Orphanet_905",
    "source_name": "wilson disease",
    "name": "Wilson disease"
  },
  "target": {
    "target_type": "http://identifiers.org/cttv.target/gene_variant",
    "id": "http://identifiers.org/ensembl/ENSG00000123191"
  },
  "unique_association_fields": {
    "gene": "ENSG00000123191",
    "clinvarAccession": "RCV000626321",
    "phenotype": "http://www.orpha.net/ORDO/Orphanet_905",
    "alleleOrigin": "germline",
    "variant_id": "rs1021025464"
  },
  "evidence": {
    "variant2disease": {
      "is_associated": true,
      "evidence_codes": [
        "http://purl.obolibrary.org/obo/ECO_0000205"
      ],
      "urls": [
        {
          "nice_name": "Further details in ClinVar database",
          "url": "http://www.ncbi.nlm.nih.gov/clinvar/RCV000626321"
        }
      ],
      "provenance_type": {
        "database": {
          "dbxref": {
            "url": "http://identifiers.org/clinvar.record/RCV000626321",
            "id": "http://identifiers.org/clinvar",
            "version": "2017-08"
          },
          "id": "EVA",
          "version": "1.0"
        },
        "expert": {
          "statement": "Primary submitter of data",
          "status": true
        },
        "literature": {
          "references": [
            {
              "lit_id": "http://europepmc.org/abstract/MED/18506894"
            },
            {
              "lit_id": "http://europepmc.org/abstract/MED/20301685"
            },
            {
              "lit_id": "http://europepmc.org/abstract/MED/20482602"
            },
            {
              "lit_id": "http://europepmc.org/abstract/MED/27854360"
            }
          ]
        }
      },
      "resource_score": {
        "type": "pvalue",
        "method": {
          "description": "Not provided by data supplier"
        },
        "value": 1e-07
      },
      "unique_experiment_reference": "http://europepmc.org/abstract/MED/18506894",
      "date_asserted": "2020-06-21T23:00:00",
      "clinvar_rating": {
        "star_rating": 2,
        "review_status": "criteria provided, multiple submitters, no conflicts"
      },
      "last_evaluated_date": "2019-11-04T00:00:00",
      "clinical_significance": [
        "likely pathogenic"
      ],
      "mode_of_inheritance": [
        "Autosomal recessive inheritance"
      ]
    },
    "gene2variant": {
      "is_associated": true,
      "evidence_codes": [
        "http://identifiers.org/eco/cttv_mapping_pipeline"
      ],
      "urls": [
        {
          "nice_name": "Further details in ClinVar database",
          "url": "http://www.ncbi.nlm.nih.gov/clinvar/RCV000626321"
        }
      ],
      "provenance_type": {
        "database": {
          "dbxref": {
            "url": "http://identifiers.org/clinvar.record/RCV000626321",
            "id": "http://identifiers.org/clinvar",
            "version": "2017-08"
          },
          "id": "EVA",
          "version": "1.0"
        },
        "expert": {
          "statement": "Primary submitter of data",
          "status": true
        }
      },
      "functional_consequence": "http://purl.obolibrary.org/obo/SO_0001631"
    }
  },
  "literature": {
    "references": [
      {
        "lit_id": "http://europepmc.org/abstract/MED/18506894"
      },
      {
        "lit_id": "http://europepmc.org/abstract/MED/20301685"
      },
      {
        "lit_id": "http://europepmc.org/abstract/MED/20482602"
      },
      {
        "lit_id": "http://europepmc.org/abstract/MED/27854360"
      }
    ]
  }
}

EVA somatic

{
  "type": "somatic_mutation",
  "access_level": "public",
  "sourceID": "eva_somatic",
  "validated_against_schema_version": "1.7.3",
  "disease": {
    "id": "http://www.ebi.ac.uk/efo/EFO_0000095",
    "source_name": "chronic lymphocytic leukemia",
    "name": "chronic lymphocytic leukemia"
  },
  "target": {
    "target_type": "http://identifiers.org/cttv.target/gene_variant",
    "id": "http://identifiers.org/ensembl/ENSG00000157764"
  },
  "unique_association_fields": {
    "gene": "ENSG00000157764",
    "clinvarAccession": "RCV000417689",
    "phenotype": "http://www.ebi.ac.uk/efo/EFO_0000095",
    "alleleOrigin": "somatic",
    "variant_id": "rs397507484"
  },
  "evidence": {
    "is_associated": true,
    "evidence_codes": [
      "http://purl.obolibrary.org/obo/ECO_0000205"
    ],
    "known_mutations": [
      {
        "functional_consequence": "http://purl.obolibrary.org/obo/SO_0001583",
        "preferred_name": "missense_variant"
      }
    ],
    "urls": [
      {
        "nice_name": "Further details in ClinVar database",
        "url": "http://www.ncbi.nlm.nih.gov/clinvar/RCV000417689"
      }
    ],
    "provenance_type": {
      "database": {
        "dbxref": {
          "url": "http://identifiers.org/clinvar.record/RCV000417689",
          "id": "http://identifiers.org/clinvar",
          "version": "2017-08"
        },
        "id": "EVA",
        "version": "1.0"
      },
      "expert": {
        "statement": "Primary submitter of data",
        "status": true
      }
    },
    "resource_score": {
      "type": "probability",
      "value": 1
    },
    "date_asserted": "2020-06-25T23:00:00",
    "clinvar_rating": {
      "star_rating": 0,
      "review_status": "no assertion criteria provided"
    },
    "last_evaluated_date": "2016-05-30T23:00:00",
    "clinical_significance": [
      "likely pathogenic"
    ],
    "mode_of_inheritance": [
      "Somatic mutation"
    ]
  }
}

tskir commented 3 years ago

After

EVA germline

{
    "datasourceId" : "eva",
    "alleleOrigins" : [
        "germline"
    ],
    "allelicRequirements" : [
        "Autosomal recessive inheritance"
    ],
    "clinicalSignificances" : [
        "likely pathogenic"
    ],
    "confidence" : "criteria provided, multiple submitters, no conflicts",
    "datatypeId" : "genetic_association",
    "diseaseFromSource" : "wilson disease",
    "diseaseFromSourceMappedId" : "Orphanet_905",
    "literature" : [
        "18506894",
        "20301685",
        "20482602",
        "27854360"
    ],
    "studyId" : "RCV000492566",
    "targetFromSourceId" : "ENSG00000123191",
    "variantFunctionalConsequenceId" : "SO_0001631",
    "variantRsId" : "rs1021025464"
}

EVA somatic

{
    "datasourceId" : "eva_somatic",
    "alleleOrigins" : [
        "somatic"
    ],
    "allelicRequirements" : [
        "Somatic mutation"
    ],
    "clinicalSignificances" : [
        "likely pathogenic"
    ],
    "confidence" : "no assertion criteria provided",
    "datatypeId" : "somatic_mutation",
    "diseaseFromSource" : "chronic lymphocytic leukemia",
    "diseaseFromSourceMappedId" : "EFO_0000095",
    "studyId" : "RCV000417689",
    "variantRsId" : "rs397507484"
}

tskir commented 3 years ago

Questions and remarks

A. Allele origin attributes: `alleleOrigins`, `datasourceId`, `datatypeId`

Under the new schema, allele origin will be reported in three different fields, with their values correlating exactly with each other. Is this expected, or should we get rid of some of them?
- alleleOrigins = germline / somatic
- datasourceType = eva / eva-somatic
- datatypeId = _genetic_association / somaticmutation
ClinVar records can have a number of different values for the “allele origin” field. How should we report values which are neither germline nor somatic? Currently this is being done using an old and unreliable heuristic which doesn't cover all options.
In addition, some records list both germline and somatic in that field. Currently in this case they are split into two evidence strings. Should we continue doing this or should we now group them under the same evidence string?

B. Association attributes: `allelicRequirements`, `clinicalSignificances`, `confidence`, `literature`, `studyId`

We used to report both star rating and review status. Under the new schema there is no field for the star rating, and it will not be reported. Is this expected?
ClinVar distinguishes between several types of literature references: disease oriented, target oriented, and “observed in” references. Under both the old and the new approaches, they are simply combined into a single list, but I just wanted to mention that if necessary, this could be reported in the future.

C. Phenotype attributes: `cohortPhenotypes`, `diseaseFromSource`, `diseaseFromSourceId`, `diseaseFromSourceMappedId`

Previously, when a ClinVar record contained multiple disease names, we used to generate one evidence string per name. Under the new schema, do I understand correctly that they should be grouped under the same evidence string?
Do I understand the meaning of all fields correctly?
- cohortPhenotypes: a list of all disease name strings, as specified in ClinVar, in alphabetical order.
- diseaseFromSource: the first string from that list.
- diseaseFromSourceId: the original (ClinVar) ontology identifier corresponding to that first string.
- diseaseFromSourceMappedId: the EFO term we mapped that first string to.
What to do in case a disease name string is mapped to multiple EFO terms? For example, “Coronary artery disease/myocardial infarction” is supposed to be mapped to two EFO terms as per our discussions. Under the old approach we used to generate an evidence string per mapping, should we continue doing that?
What to do in case a disease name string in ClinVar has multiple original ontology identifiers?
Previously, we used to report the disease name both in ClinVar and in EFO in two different fields. In the new schema, only the source names are reported. Is this expected?

D. Variant attributes: `targetFromSourceId`, `variantFunctionalConsequenceId`, `variantId`, `variantRsId`

Currently, in case a variant has effect in multiple genes, we generate an evidence string per gene. Should we continue doing so?

E. General questions

Under the new schema, several fields which are currently reported will disappear. Please confirm this is expected:
- All “*/evidence codes” fields.
- Variant type in target/target_type.
- Date information in date_asserted and last_evaluated_date.

DSuveges commented 3 years ago

Hi Kirill,

Please find our answers below:

their values correlating exactly with each other. Is this expected, or should we get rid of some of them?

Your observation is correct, however, please keep them. We believe it will be developed later.

ClinVar records can have a number of different values for the “allele origin” field.

Please remove that "unreliable heuristic" funciton from the pipeline and and provide the original annotation you have from clinvar. (It also improves the annotation for alleleOrigins under 1.)

In addition, some records list both germline and somatic in that field.

While we handle somatic and germline variants differently, please just explode the evidence into two as it is now.

We used to report both star rating and review status. Under the new schema there is no field for the star rating, and it will not be reported.

Yes, it's expected: as the there is a one to one correspondence between start rating and review status, frontend will do this mapping. (It's a general effort across all sources and fields: if a value can be inferred, the fronted's logic will do the job to keep the data as slim as possible)

ClinVar distinguishes between several types of literature references: disease oriented, target oriented, and “observed in” references.

Thank you so much for pointing this out! We suspected there is some discrepancy in the literature, but we didn't know what the souce is. So we would like to keep only those references that supports the target/disease evidence. Probably the observed_in category, but we are not 100% sure. Let's consider this assocition. The second, clinvar evidence with one star (RCV001009374) expected to have 16 linterature references, however if we check the ClinVar website there's only one PubMed citation. So we would like to get that one reference.

do I understand correctly that they should be grouped under the same evidence string?

We would still like to see the evidence exploded into all mapped disease.

Do I understand the meaning of all fields correctly?

We think one ClinVar enty might come with multiple disease terms. These terms are mapped to EFO. This mapping is potentially many to many: there might be multiple EFOs you can map, in which case please explode the evidence into all EFOs. But also many disease terms can be mapped to the same EFO. So:

cohortPhenotypes: the list of all disease terms.
diseaseFromSource: the first of those disease terms that are mapped to the EFO term.
diseaseFromSourceId: the ClinVar disease id of the term under diseaseFromSource. If available.
diseaseFromSourceMappedId: the EFO id which is mapped from the term diseaseFromSource.

What to do in case a disease name string is mapped to multiple EFO terms?

Please create as many evidence as many EFOs can be mapped on the avilable disease terms for the ClinVar entry.

What to do in case a disease name string in ClinVar has multiple original ontology identifiers?

Try to keep the MedGen identifier.

In the new schema, only the source names are reported. Is this expected?

Yes. That's correct. As under mentioned under 4. EFO/id resolved on our side.

Currently, in case a variant has effect in multiple genes, we generate an evidence string per gene. Should we continue doing so?

Yes. Please keep exploding to all genes.

Please confirm this is expected:

Yes. That's expected. We are not using these values, however the evidence_code might come back later if there are multiple data sources from which this field would be useful.

If there's any further questions, or some of the explanation is not clear, please let us know!

tskir commented 3 years ago

@DSuveges Thanks a lot for your comments! Here are my replies, and we can discuss further in the upcoming meeting:

2. ClinVar records can have a number of different values for the “allele origin” field

Please remove that "unreliable heuristic" funciton from the pipeline and and provide the original annotation you have from clinvar.

This can be done but there are a couple of obstacles:

alleleOrigins is currently an enum, and we can't pass arbitrary strings into it;
Decision on the datasourceId and datatypeId fields depends on the strictly binary somatic/germline classification, and how to deduce them for other types is unclear.

5. ClinVar distinguishes between several types of literature references: disease oriented, target oriented, and “observed in” references

So we would like to keep only those references that supports the target/disease evidence. Probably the observed_in category, but we are not 100% sure.

No problem, will do! Glad we've sorted this out. I'll look into “observed in” references to confirm that they're indeed what you want to get.

6–10. Phenotype attributes questions

Ah, I see, it all makes much more sense now. Let me just confirm my understanding using an example. Say we have these disease names and mappings in a single RCV:

Disease string	EFO mapping
Disease A	EFO 1
Disease B	EFO 1
Disease C	EFO 1
Disease D	EFO 2 + EFO 3
Disease E	no mapping

Am I right that you want this collection of evidence strings at an output?

	Evidence string 1	Evidence string 2	Evidence string 3
`cohortPhenotypes`	Disease A, B, C, D, E	Disease A, B, C, D, E	Disease A, B, C, D, E
`diseaseFromSource`	Disease A	Disease D	Disease D
`diseaseFromSourceID`	MedGen A	MedGen D	MedGen D
`diseaseFromSourceMappedId`	EFO 1	EFO 2	EFO 3

Disease E is not reported because of a lack of mapping.

tskir commented 3 years ago

Decisions made on the call

Allele origins
- Report the original allele origins (will need to remove enum)
- Stored in a list (already so in the schema)
- Everything which is not somatic is considered germline
- Somatic and germline are always separated into different evidence strings
Literature evidence
- What is required are only references for associations between disease and variant
Phenotype attributes
- cohortPhenotypes should include all diseases from the RCV (in the example, Disease A through E)

tskir commented 3 years ago

@DSuveges I'm almost done implementing the changes above, but in the meanwhile, one additional question. Currently, we get variant coordinates from ClinVar, run our VEP mapping pipeline, and from it we get the values to populate targetFromSourceId and variantFunctionalConsequenceId. Would you like us to continue doing that? Or should those fields also be populated exclusively from ClinVar itself?

DSuveges commented 3 years ago

Hi @tskir, the current mapping pipeline was well thought out, let's keep using that.

EBIvariation / CMAT

Migrate to JSON schema v2 #189

Before

EVA germline

EVA somatic

After

EVA germline

EVA somatic

Questions and remarks

A. Allele origin attributes: alleleOrigins, datasourceId, datatypeId

B. Association attributes: allelicRequirements, clinicalSignificances, confidence, literature, studyId

C. Phenotype attributes: cohortPhenotypes, diseaseFromSource, diseaseFromSourceId, diseaseFromSourceMappedId

D. Variant attributes: targetFromSourceId, variantFunctionalConsequenceId, variantId, variantRsId

E. General questions

2. ClinVar records can have a number of different values for the “allele origin” field

5. ClinVar distinguishes between several types of literature references: disease oriented, target oriented, and “observed in” references

6–10. Phenotype attributes questions

A. Allele origin attributes: `alleleOrigins`, `datasourceId`, `datatypeId`

B. Association attributes: `allelicRequirements`, `clinicalSignificances`, `confidence`, `literature`, `studyId`

C. Phenotype attributes: `cohortPhenotypes`, `diseaseFromSource`, `diseaseFromSourceId`, `diseaseFromSourceMappedId`

D. Variant attributes: `targetFromSourceId`, `variantFunctionalConsequenceId`, `variantId`, `variantRsId`