clingen-data-model / clinvar-ingest

Apache License 2.0
2 stars 0 forks source link

Implement Trait and TraitMapping #38

Closed theferrit32 closed 10 months ago

theferrit32 commented 11 months ago

Trait:

Trait Set:

Trait Mapping:

theferrit32 commented 11 months ago

Example coming from original-clinvar-variation-2.xml

VariationArchive.InterpretedRecord.TraitMappingList

<TraitMappingList>
  <TraitMapping
      ClinicalAssertionID="20155"
      TraitType="Disease"
      MappingType="Name"
      MappingValue="SPASTIC PARAPLEGIA 48, AUTOSOMAL RECESSIVE"
      MappingRef="Preferred">
    <MedGen CUI="C3150901" Name="Hereditary spastic paraplegia 48"/>
  </TraitMapping>
  <TraitMapping
      ClinicalAssertionID="2865972"
      TraitType="Disease"
      MappingType="XRef"
      MappingValue="613647"
      MappingRef="OMIM">
    <MedGen CUI="C3150901" Name="Hereditary spastic paraplegia 48"/>
  </TraitMapping>
</TraitMappingList>

VariationArchive.InterpretedRecord.Interpretations.Interpretation.ConditionList

<ConditionList>
  <TraitSet ID="2" Type="Disease" ContributesToAggregateClinsig="true">
    <Trait ID="9580" Type="Disease">
      <Name>
        <ElementValue Type="Preferred">Hereditary spastic paraplegia 48</ElementValue>
        <XRef ID="MONDO:0013342" DB="MONDO"/>
      </Name>
      <Name>
        <ElementValue Type="Alternate">Spastic paraplegia 48</ElementValue>
      </Name>
      <Name>
        <ElementValue Type="Alternate">Spastic paraplegia 48, autosomal recessive</ElementValue>
        <XRef ID="Spastic+paraplegia+48%2C+autosomal+recessive/9323" DB="Genetic Alliance"/>
      </Name>
      <Symbol>
        <ElementValue Type="Alternate">SPG48</ElementValue>
        <XRef Type="MIM" ID="613647" DB="OMIM"/>
      </Symbol>
      <XRef ID="306511" DB="Orphanet"/>
      <XRef ID="C3150901" DB="MedGen"/>
      <XRef ID="MONDO:0013342" DB="MONDO"/>
      <XRef Type="MIM" ID="613647" DB="OMIM"/>
    </Trait>
  </TraitSet>
</ConditionList>

ClinicalAssertions

<ClinicalAssertion ID="20155" SubmissionDate="2017-01-26" 
    DateLastUpdated="2017-01-30" DateCreated="2013-04-04">
        <ClinVarSubmissionID localKey="613653.0001_SPASTIC PARAPLEGIA 48, AUTOSOMAL RECESSIVE" title="AP5Z1, 4-BP DEL/22-BP INS, NT80_SPASTIC PARAPLEGIA 48, AUTOSOMAL RECESSIVE"/>
        <ClinVarAccession Accession="SCV000020155" DateUpdated="2017-01-30" DateCreated="2013-04-04" Type="SCV" Version="3" SubmitterName="OMIM" OrgID="3" OrganizationCategory="resource"/>
        <RecordStatus>current</RecordStatus>
        <ReviewStatus>no assertion criteria provided</ReviewStatus>
        <Interpretation DateLastEvaluated="2010-06-29">
          <Description>Pathogenic</Description>
        </Interpretation>
        <Assertion>variation to disease</Assertion>
        <ObservedInList>
          …
        </ObservedInList>
        <SimpleAllele>
          …
        </SimpleAllele>
        <TraitSet Type="Disease">
          <Trait Type="Disease">
            <Name>
              <ElementValue Type="Preferred">SPASTIC PARAPLEGIA 48, AUTOSOMAL RECESSIVE</ElementValue>
            </Name>
          </Trait>
        </TraitSet>
      </ClinicalAssertion>
bpblanken commented 11 months ago

Is this something I could do?

theferrit32 commented 11 months ago

@bpblanken I've started working on this

theferrit32 commented 11 months ago

After discussion with @larrybabb , we will start with just loading the trait, trait_set, trait_mapping. And when loading clinical_assertion_trait/trait_set, load all the fields except the normalized/mapped trait_set_id and trait_id.

For rcv_accession, instead of computing the trait set ID as DSP was doing, we will use the TraitSetID added by ClinVar. If no TraitSetID is in the XML, leave the field blank for now.

theferrit32 commented 11 months ago

For trait lists inside trait sets, use canonicaljson to ensure the order is sorted consistently so the generated IDs for each release are the same.

theferrit32 commented 11 months ago

For Trait xrefs, there are optional fields ref_field and ref_field_element. In addition to XRef, Traits have multiple other types of internal elements, like Name, Symbol, each of which can also have XRefs. To differentiate these, the source element type and the differentiating value are provided.

theferrit32 commented 11 months ago

In Trait, each AttributeSet can have 0..n XRef. The ref_field should be constructed from the AttributeSet.Attribute.Type.

e.g. AttributeSet.Attribute.Type="GARD id -> `ref_field="gard_id"

Then the ref_field_element depends on the type.

see: https://github.com/DataBiosphere/clinvar-ingest/blob/6ea03d1a334af6637992de9294c89b9327d721ac/transformation/src/main/scala/org/broadinstitute/monster/clinvar/parsers/Interpretation.scala#L136-L188

VariationArchive.InterpretedRecord.Interpretations.Interpretation.ConditionList

        <ConditionList>
          <TraitSet ID="43" Type="Disease" ContributesToAggregateClinsig="true">
            <Trait ID="14121" Type="Disease">
              <Name>
                <ElementValue Type="Alternate">Farber's lipogranulomatosis</ElementValue>
                <XRef ID="79935000" DB="SNOMED CT"/>
              </Name>
              <Name>
                <ElementValue Type="Alternate">Farber's disease</ElementValue>
              </Name>
              <Name>
                <ElementValue Type="Alternate">Farber disease</ElementValue>
                <XRef ID="333" DB="Orphanet"/>
              </Name>
              <Name>
                <ElementValue Type="Preferred">Farber lipogranulomatosis</ElementValue>
                <XRef ID="MONDO:0009218" DB="MONDO"/>
                <XRef ID="79935000" DB="SNOMED CT"/>
              </Name>
              <Name>
                <ElementValue Type="Alternate">Ceramidase deficiency</ElementValue>
              </Name>
              <Name>
                <ElementValue Type="Alternate">Acid ceramidase deficiency</ElementValue>
              </Name>
              <Name>
                <ElementValue Type="Alternate">AC deficiency</ElementValue>
              </Name>
              <Name>
                <ElementValue Type="Alternate">N-Laurylsphingosine deacylase deficiency</ElementValue>
              </Name>
              <Symbol>
                <ElementValue Type="Preferred">FRBRL</ElementValue>
                <XRef Type="MIM" ID="228000" DB="OMIM"/>
              </Symbol>
              <AttributeSet>
                <Attribute Type="GARD id" integerValue="6426"/>
                <XRef ID="6426" DB="Office of Rare Diseases"/>
              </AttributeSet>
              <AttributeSet>
                <Attribute Type="public definition">The spectrum of ASAH1-related disorders ranges from Farber disease (FD) to spinal muscular atrophy with progressive myoclonic epilepsy (SMA-PME). Classic FD is characterized by onset in the first weeks of life of painful, progressive deformity of the major joints; palpable subcutaneous nodules of joints and mechanical pressure points; and a hoarse cry resulting from granulomas of the larynx and epiglottis. Life expectancy is usually less than two years. In the other less common types of FD, onset, severity, and primary manifestations vary. SMA-PME is characterized by early-childhood-onset progressive lower motor neuron disease manifest typically between ages three and seven years as proximal lower-extremity weakness, followed by progressive myoclonic and atonic seizures, tremulousness/tremor, and sensorineural hearing loss. Myoclonic epilepsy typically begins in late childhood after the onset of weakness and can include jerking of the upper limbs, action myoclonus, myoclonic status, and eyelid myoclonus. Other findings include generalized tremor, and cognitive decline. The time from disease onset to death from respiratory complications is usually five to 15 years.</Attribute>
                <XRef ID="NBK488189" DB="GeneReviews"/>
              </AttributeSet>
              <Citation Type="review" Abbrev="GeneReviews">
                <ID Source="PubMed">29595935</ID>
                <ID Source="BookShelf">NBK488189</ID>
              </Citation>
              <XRef ID="333" DB="Orphanet"/>
              <XRef ID="C0268255" DB="MedGen"/>
              <XRef ID="MONDO:0009218" DB="MONDO"/>
              <XRef Type="MIM" ID="228000" DB="OMIM"/>
            </Trait>
          </TraitSet>
        </ConditionList>
theferrit32 commented 11 months ago

For trait xrefs coming from attribute public definition, it looks like none of the attribute values get picked up by the DSP clinvar-ingest.

query:

SELECT 
  id AS trait_id, 
  xref_db,
  xref_id,
  xref_type,
  xref_ref_field, 
  xref_ref_field_element
FROM
(SELECT 
  id,
  xref,
  JSON_QUERY(xref, "$.db") AS xref_db,
  JSON_QUERY(xref, "$.id") AS xref_id,
  JSON_QUERY(xref, "$.type") AS xref_type,
  JSON_QUERY(xref, "$.ref_field") AS xref_ref_field,
  JSON_QUERY(xref, "$.ref_field_element") AS xref_ref_field_element
FROM `clingen-stage.clinvar_2023_11_04_v1_6_61.trait`
CROSS JOIN UNNEST(xrefs) AS xref)
WHERE xref_ref_field = "\"public_definition\""
  -- AND xref_ref_field_element IS NOT NULL

Above query has results, but with the last line uncommented to remove null xref_ref_field_element, there are no results. Reading the scala code it looks like it was intended to include the public definition node contents from the XML. Though those seem like they can be very long, it's generally a paragraph.

https://github.com/DataBiosphere/clinvar-ingest/blob/6ea03d1a334af6637992de9294c89b9327d721ac/transformation/src/main/scala/org/broadinstitute/monster/clinvar/parsers/Interpretation.scala#L136-L141

theferrit32 commented 11 months ago

Nevermind, they actually intentionally left out the public_definition value from the xref ref_field_element, and left it out of any xref where there's expected to only be one. So it's only filled in on keywords. https://github.com/DataBiosphere/clinvar-ingest/blob/6ea03d1a334af6637992de9294c89b9327d721ac/transformation/src/main/scala/org/broadinstitute/monster/clinvar/parsers/Interpretation.scala#L139

theferrit32 commented 11 months ago

gard id example

<AttributeSet>
  <Attribute Type="GARD id" integerValue="10511"/>
  <XRef ID="10511" DB="Office of Rare Diseases"/>
</AttributeSet>

multiple keyword example (trait set 3976)

              <AttributeSet>
                <Attribute Type="keyword">Hereditary cancer syndrome</Attribute>
              </AttributeSet>
              <AttributeSet>
                <Attribute Type="keyword">Neoplasm</Attribute>
              </AttributeSet>

mode of inheritance example (trait set 88)

             <AttributeSet>
                <Attribute Type="mode of inheritance">autosomal recessive or autosomal dominant</Attribute>
                <XRef ID="GTR000502548" DB="Genetic Testing Registry (GTR)"/>
              </AttributeSet>
theferrit32 commented 10 months ago

Some XRefs appear multiple times in a trait record in the XML, in different locations. Both will be included, as they are with the DSP clinvar-ingest code. Below is an example of MONDO id MONDO:0021001 being included at the top level of the trait (ref_field=null), and inside the preferred Name attribute on the trait.

One discrepancy with our list below and DSP's is that DSP removes the MedGen id when they extract it to put it as a top level Trait object field. (TODO)

DSP xrefs for trait id = '9582'

SELECT ARRAY_LENGTH(xrefs) as ct
FROM `clingen-stage.clinvar_2023_10_07_v1_6_61.trait`
WHERE id = '9582'

(count = 29)

[
  {
    "db": "MONDO",
    "id": "MONDO:0021001"
  },
  {
    "db": "OMIM",
    "id": "235200",
    "type": "MIM"
  },
  {
    "db": "Orphanet",
    "id": "139498"
  },
  {
    "db": "Orphanet",
    "id": "465508"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000028914",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000260619",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000264968",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000271417",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000332464",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000500300",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000500638",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000501267",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000501371",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000507663",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000508786",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000508970",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000509340",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000521586",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000528695",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000531271",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000551894",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000558542",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000558915",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000560323",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Genetic Testing Registry (GTR)",
    "id": "GTR000560567",
    "ref_field": "disease_mechanism"
  },
  {
    "db": "Office of Rare Diseases",
    "id": "10417",
    "ref_field": "gard_id"
  },
  {
    "db": "MONDO",
    "id": "MONDO:0021001",
    "ref_field": "name"
  },
  {
    "db": "GeneReviews",
    "id": "NBK1440",
    "ref_field": "public_definition"
  },
  {
    "db": "OMIM",
    "id": "235200",
    "type": "MIM",
    "ref_field": "symbol"
  }
]

XRefs from clinvar-ingest for trait 9582:

(count = 30)

[
    {
      "db": "MONDO",
      "id": "MONDO:0021001",
      "type": null,
      "ref_field": "name",
      "ref_field_element": "Hemochromatosis type 1"
    },
    {
      "db": "OMIM",
      "id": "235200",
      "type": "MIM",
      "ref_field": "symbol",
      "ref_field_element": "HFE1"
    },
    {
      "db": "GeneReviews",
      "id": "NBK1440",
      "type": null,
      "ref_field": "public_definition",
      "ref_field_element": null
    },
    {
      "db": "Office of Rare Diseases",
      "id": "10417",
      "type": null,
      "ref_field": "gard_id",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000028914",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000271417",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000500300",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000501267",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000531271",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000551894",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000558542",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000560323",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000558915",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000501371",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000507663",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000508970",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000509340",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000560567",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000260619",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000264968",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000332464",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000500638",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000508786",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000521586",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Genetic Testing Registry (GTR)",
      "id": "GTR000528695",
      "type": null,
      "ref_field": "disease_mechanism",
      "ref_field_element": null
    },
    {
      "db": "Orphanet",
      "id": "139498",
      "type": null,
      "ref_field": null,
      "ref_field_element": null
    },
    {
      "db": "Orphanet",
      "id": "465508",
      "type": null,
      "ref_field": null,
      "ref_field_element": null
    },
    {
      "db": "MedGen",
      "id": "C3469186",
      "type": null,
      "ref_field": null,
      "ref_field_element": null
    },
    {
      "db": "MONDO",
      "id": "MONDO:0021001",
      "type": null,
      "ref_field": null,
      "ref_field_element": null
    },
    {
      "db": "OMIM",
      "id": "235200",
      "type": "MIM",
      "ref_field": null,
      "ref_field_element": null
    }
  ]
theferrit32 commented 10 months ago

Traits can have multiple Gard IDs:

          <TraitSet ID="1347" Type="Disease" ContributesToAggregateClinsig="true">
            <Trait ID="3510" Type="Disease">
              <Name>
                <ElementValue Type="Alternate">Polydactyly, preaxial II</ElementValue>
              </Name>
              <Name>
                <ElementValue Type="Preferred">Polydactyly of a triphalangeal thumb</ElementValue>
                <XRef ID="MONDO:0008270" DB="MONDO"/>
              </Name>
              <Name>
                <ElementValue Type="Alternate">POLYDACTYLY OF TRIPHALANGEAL THUMB</ElementValue>
                <XRef Type="MIM" ID="174500" DB="OMIM"/>
              </Name>
              <Name>
                <ElementValue Type="Alternate">TRIPHALANGEAL THUMB-POLYDACTYLY SYNDROME</ElementValue>
                <XRef Type="MIM" ID="174500" DB="OMIM"/>
              </Name>
              <Symbol>
                <ElementValue Type="Alternate">PPD2</ElementValue>
                <XRef Type="MIM" ID="174500" DB="OMIM"/>
              </Symbol>
              <AttributeSet>
                <Attribute Type="GARD id" integerValue="4260"/>
                <XRef ID="4260" DB="Office of Rare Diseases"/>
              </AttributeSet>
              <AttributeSet>
                <Attribute Type="GARD id" integerValue="5289"/>
                <XRef ID="5289" DB="Office of Rare Diseases"/>
              </AttributeSet>
              <XRef ID="2439" DB="Orphanet"/>
              <XRef ID="2950" DB="Orphanet"/>
              <XRef ID="93336" DB="Orphanet"/>
              <XRef ID="C1868114" DB="MedGen"/>
              <XRef ID="MONDO:0008270" DB="MONDO"/>
              <XRef Type="MIM" ID="174500" DB="OMIM"/>
            </Trait>
          </TraitSet>
theferrit32 commented 10 months ago

Output field attribute_content should have the remainder of the Attributes in AttributeSet. Those not explicitly popped into other fields.

It is a STRING REPEATED, so each attribute should be be a separate array entry and be JSON-encoded

theferrit32 commented 10 months ago

Closed by #48