Closed larrybabb closed 10 months ago
@theferrit32 I think I'm going to apply some of the rules above to the upstream data and modify the variation_identity
schema to capture it so that we can keep all the logic for applying these policies in one place. I'll keep you informed.
@theferrit32 I added a new field to the variation_identity
called seq.derived_hgvs
which produces the best representation of the seq.* fields into an hgvs expression for any "Deletion", "Duplication", "copy number loss/gain" variation types. So for steps #6 and #8 above you do not have to concern yourself with doing that work. I updated the GCS file that contains the variation_identity records and you can go to the big query staging project to look in the clinvar_2023_04_10_v1... dataset to see the change to the schema which includes the new derived_hgvs
field in the seq
struct. Let me know if you have any questions. I assume this will speed up your portion of the task.
@theferrit32 I added one more struct vrs_xform_plan
on the end of the variation_identity records for each variant that provides the type
(Allele, CopyNumberCount, CopyNumberChange, Text), inputs
(an array of the fields within the variation_identity record that would be used in the normalizer api call to get the vrs object back), and policy
(the rule from the description on this ticket above that was used to determine how the particular variation record was bucketed).
for Allele
types use the translate_from
api
for CopyNumberCount
use the hgvs_to_copy_number_count
for CopyNumberChange use the
hgvs_to_copy_number_change` (and the appropriate EFO code)
Note: Ignore all Text
variations once we start moving to the 2.0alpha schema since we will not be dealing with Text
variation the same way.
This task involves establishing a new process that will take the a Google GCS compressed json file for
variation_identity
records for a given release or subset of a release and convert each one into a baseline VRS variation by calling the normalization API.The GCS files will be provided from the folder
gs://clinvar-gk-pilot
starting with the data from clinvar release 2023-04-10&project=clingen-stage)The schema for the
variation_identity
input json can be found in the BQclingen-stage
project under the datasetclinvar_2023_04_10_v1_6_58
.The variation_identity record processing rules in order of precedence are as follows:
canonical_spdi
value use it to call the to_canonical_variation API. Extract the canonical_variation.variation object which will represent the 'Allele' for the input variation record.subclass_type
is eitherGenotype
orHaplotype
then use theid
(clinvar variation id) as the input to get a VRSText
object for this record.id
(clinvar variation id) as the input to get a VRSText
object for this record.id
(clinvar variation id) as the input to get a VRSText
object for this record.id
(clinvar variation id) as the input to get a VRSText
object for this record.absolute_copies
value is not null use thehgvs.nucleotide
value if it is available to call the hgvs_to_copy_number_count api. If thehgvs.nucleotide
value is null then use theseq.derived_hgvs
value if it is available.min_copies
and/ormax_copies
is not null then use the `'id' to create a Text variation.variation_type
is 'Deletion', 'Duplication', 'copy number loss', or 'copy number gain' then use the hgvs.nucleotide expression to call the hgvs_to_copy_number_change API. If the hgvs.nucleotide value is null then use theseq.derived_hgvs
if it is not null .hgvs.nucleotide
field, then use it to call the to_vrs api.id
value to create a Text variation.Separate the different VRS variation type output objects into separate files (i.e. Allele, Text, CopyNumberCount, CopyNumberChange) in json format and store in the same GCS bucket in gzip compressed form.