diskin-lab-chop / AutoGVP

17 stars 3 forks source link

Add script to select clinVar variant submissions #105

Closed rjcorb closed 1 year ago

rjcorb commented 1 year ago

Purpose/implementation Section

What feature is being added or bug is being addressed?

This PR adds select-clinVar-submissions.R script, which loads clinVar variant and submission summary files and selects unique submission calls per variant based on criteria defined in AutoGVP pathogenicity assessment workflow.

What was your approach?

Copied code from 01-annotate-variants-CAVATICA-input.R and 01-annotate-variants-custom-input.R, and wrote to a new R script. The script output, ClinVar-selected-submissions.tsv, is subsequently loaded by AutoGVP to resolve variant calls. This allows for clinVar submission selection to be performed once by a user prior to running AutoGVP script, which greatly reduces the AutoGVP runtime for each sample.

What GitHub issue does your pull request address?

104

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please review code in select-clinVar-submissions.R, and run script as follows:

Rscript select-clinVar-submissions.R --variant_summary input/variant_summary.txt.gz --submission_summary input/submission_summary.txt.gz

Please also run AutoGVP on test files using this new variant summary input as follows:

custom workflow:

Rscript 01-annotate_variants_custom_input.R --vcf input/test_VEP.vcf \
--intervar input/test_VEP.hg38_multianno.txt.intervar \
--multianno input/test_VEP.vcf.hg38_multianno.txt \
--variant_summary input/ClinVar-selected-submissions.tsv  \
--autopvs1 input/test_autopvs1.txt \
--clinvar input/clinvar.vcf.gz \
--output "test_custom"

Cavatica workflow:

Rscript 01-annotate_variants_CAVATICA_input.R --vcf input/test-cavatica.single.vqsr.filtered.vep_105.vcf \
--intervar input/test-cavatica.hg38_multianno.txt.intervar \
--multianno input/test-cavatica.hg38_multianno.txt \
--variant_summary input/ClinVar-selected-submissions.tsv \
--autopvs1 input/test-cavatica.autopvs1.tsv \
--output "test_cavatica"

Is there anything that you want to discuss further?

We will have include code in the wrapper script that checks if this script has been run and output file has been generated prior to running AutoGVP.

Documentation Checklist

jharenza commented 1 year ago

If running from the AutoGVP/AutoGVP directory, the command should be:

Rscript select-clinVar-submissions.R --variant_summary input/variant_summary.txt.gz --submission_summary input/submission_summary.txt.gz

Note: the files are now gzipped.

rjcorb commented 1 year ago

@jharenza all suggestions have been implemented

jharenza commented 1 year ago

I do get a new md5sum with the updates (possibly due to not removing the rows with NA gene symbol now), but wanted to make sure this is expected: old:

131c07f9fa363ba11cc00a728dc365a9  input/ClinVar-selected-submissions.tsv

new:

2df7e183e5892a2cb9ee7014dda7b4df  input/ClinVar-selected-submissions.tsv
rjcorb commented 1 year ago

@jharenza Yes, I can confirm the changes in output are due to the inclusion of variants/submissions that were previously filtered out