Establish a baseline Variation to VRS (allele, CN or Text) output

larrybabb commented 1 year ago

This task involves establishing a new process that will take the a Google GCS compressed json file for variation_identity records for a given release or subset of a release and convert each one into a baseline VRS variation by calling the normalization API.

The GCS files will be provided from the folder gs://clinvar-gk-pilot starting with the data from clinvar release 2023-04-10&project=clingen-stage)

The schema for the variation_identity input json can be found in the BQ clingen-stage project under the dataset clinvar_2023_04_10_v1_6_58.

The variation_identity record processing rules in order of precedence are as follows:

Canonical SPDI (Allele): if there is a canonical_spdi value use it to call the to_canonical_variation API. Extract the canonical_variation.variation object which will represent the 'Allele' for the input variation record.
Genotype/Haplotype (Text): if the subclass_type is either Genotype or Haplotype then use the id (clinvar variation id) as the input to get a VRS Text object for this record.
Genome NCBI36 only (Text): if at least one of the hgvs.assembly or seq.assembly has the value 'NCBI36' and neither one has 'GRCh37' or 'GRCh38' then use the id (clinvar variation id) as the input to get a VRS Text object for this record.
No HGVS or SeqLoc info (Text): if the hgvs.assembly, hgvs.nucleotide, seq.assembly and seq.accession are all null then use the id (clinvar variation id) as the input to get a VRS Text object for this record.

Invalid/Unsupported HGVS (Text): If the hgvs.nucleotide is not null and it does NOT match one of the following REGEXPs then use the id (clinvar variation id) as the input to get a VRS Text object for this record.

a. snvs: '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.[0-9]+[ACTG]\>[ACTG]+$'
b. same as ref: '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.[0-9]+[ACTG]?\=$'
c. single residue dup/del/delins: '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.[0-9]+(dup|del|delins)[ACTG]*$') 
d. precise range dup/del/delins/ins: '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.[0-9]+\_[0-9]+(dup|del|delins|ins)[ACTG]*$'
e. precise inner/outer range dup or del or delins or ins: '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.\([0-9]+\_[0-9]+\)\_\([0-9]+\_[0-9]+\)(dup|del|delins|ins)[ACTG]*$'
f. imprecise outer range dup or del or delins or ins '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.\(\?\_[0-9]+\)\_\([0-9]+\_\?\)(dup|del|delins|ins)[ACTG]*$'
g. imprecise inner range dup or del or delins or ins '(CM|N[CTW]\_)[0-9]+\.[0-9]+\:[gm]\.\([0-9]+\_\?\)\_\(\?\_[0-9]+\)(dup|del|delins|ins)[ACTG]*$'

Names ending in 'x[0-9]+' (CopyNumberCount): When the absolute_copies value is not null use the hgvs.nucleotide value if it is available to call the hgvs_to_copy_number_count api. If the hgvs.nucleotide value is null then use the seq.derived_hgvs value if it is available.
CNVs with min,max counts (Text): If the min_copies and/or max_copies is not null then use the `'id' to create a Text variation.
Remaining copy loss/gain, dels/dups (CopyNumberChange): If the variation_type is 'Deletion', 'Duplication', 'copy number loss', or 'copy number gain' then use the hgvs.nucleotide expression to call the hgvs_to_copy_number_change API. If the hgvs.nucleotide value is null then use the seq.derived_hgvs if it is not null .
Remaining supported genomic HGVS (Allele): If any remaining records have a value in the hgvs.nucleotide field, then use it to call the to_vrs api.
Insufficient information (Text): all remaining should use the id value to create a Text variation.

Separate the different VRS variation type output objects into separate files (i.e. Allele, Text, CopyNumberCount, CopyNumberChange) in json format and store in the same GCS bucket in gzip compressed form.

larrybabb commented 1 year ago

@theferrit32 I think I'm going to apply some of the rules above to the upstream data and modify the variation_identity schema to capture it so that we can keep all the logic for applying these policies in one place. I'll keep you informed.

larrybabb commented 1 year ago

@theferrit32 I added a new field to the variation_identity called seq.derived_hgvs which produces the best representation of the seq.* fields into an hgvs expression for any "Deletion", "Duplication", "copy number loss/gain" variation types. So for steps #6 and #8 above you do not have to concern yourself with doing that work. I updated the GCS file that contains the variation_identity records and you can go to the big query staging project to look in the clinvar_2023_04_10_v1... dataset to see the change to the schema which includes the new derived_hgvs field in the seq struct. Let me know if you have any questions. I assume this will speed up your portion of the task.

larrybabb commented 1 year ago

@theferrit32 I added one more struct vrs_xform_plan on the end of the variation_identity records for each variant that provides the type (Allele, CopyNumberCount, CopyNumberChange, Text), inputs (an array of the fields within the variation_identity record that would be used in the normalizer api call to get the vrs object back), and policy (the rule from the description on this ticket above that was used to determine how the particular variation record was bucketed).

for Allele types use the translate_from api for CopyNumberCount use the hgvs_to_copy_number_count for CopyNumberChange use thehgvs_to_copy_number_change` (and the appropriate EFO code)

Note: Ignore all Text variations once we start moving to the 2.0alpha schema since we will not be dealing with Text variation the same way.

clingen-data-model / genegraph

Establish a baseline Variation to VRS (allele, CN or Text) output #778