clingen-data-model / clinvar-streams

1 stars 0 forks source link

Extract HGVS expressions and related attributes from the Variation.content field #58

Open larrybabb opened 2 years ago

larrybabb commented 2 years ago

This is the first in several class content fields that originate from the dsp clinvar ingest stream that will need to be parsed and stored formally in the final transformed messages.

This first element is the array of HGVS expressions that are embedded in the Variation.content serialized json field.

Each Variation object will have zero, one or more HGVS elements in the stringified json content attribute.

The json path $.HGVSlist.HGVS may either be an array (when more than one exists) or a single element (when only one exists). I believe there is no $.HGVSlist.HGVS node found when no HGVS expressions exist for a variation, but it may be an empty array or an empty single node (can't recall right now).

Each HGVS node will need to be parsed into a structure with the following shape:

hgvs.assembly - $['@Assembly']
hgvs.type - $['@Type']
hgvs.nucleotideExpression   - $['NucleotideExpression']['Expression']['$']
hgvs.nucleotideExpression.isManeSelect  - $['NucleotideExpression']['@MANESelect'].   -- boolean TRUE/FALSE
hgvs.proteinExpression -  $['ProteinExpression']['Expression']['$'] 
hgvs.molecularConsequence.db - $['MolecularConsequence']['@DB']
hgvs.molecularConsequence.id - $['MolecularConsequence']['@ID']
hgvs.molecularConsequence.type  - $['MolecularConsequence']['@Type']

Some general patterns that may be informational as to how these fields are typically populated...

We will need to do some finalization of the destination structure for this data in our GeneGraph model. For a general reference these fields will ultimately end up in the VariationDescriptor class that is associated with the core VCV and SCV statements being transformed.

larrybabb commented 2 years ago

NOTE: eventually we will be extracting ALL the data from the various Class.content fields. In the initial MVP for the standardization of ClinVar into GeneGraph we will be identifying fields from the ClinicalAssertionObservation.content json and possibly from the GeneAssociation.content.

These are yet to be written up.

theferrit32 commented 1 year ago

See genegraph.transform.clinvar.variation for the current way this is done.

https://github.com/clingen-data-model/genegraph/blob/6c3a3a051af7d85b014f4f495549356193aea25d/src/genegraph/transform/clinvar/variation.clj#L47-L157