Purpose/implementation Section

What feature is being added or bug is being addressed?

This PR creates a script that filters parsed vcf file to retain one gene annotation row per variant. This data frame is subsequently merged from AutoGVP output to create final comprehensive and abridged outputs.

What was your approach?

04-filter_gene_annotations.R performs the following:

Separates CSQ column in parsed vcf file so that subfields are column-separated (separate_wider_delim), and gene/transcript annotations are row-separated (separate_longer_delim)
Utilizes the PICK column the retain a single gene annotation row per variant (chosen based on canonical transcript status, transcript support level, transcript type, highest impact gene consequence, etc).
Merges parsed vcf with output from 01-annotate_variants_CAVATICA.R or 01-annotate_variants_custom.R

What GitHub issue does your pull request address?

81

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please run script on test data as follows and review output:

Rscript 04-filter_gene_annotations.R --vcf input/test_pbta_filtered_parsed_vcf.tsv \
--autogvp input/test_pbta.cavatica_input.annotations_report.abridged.tsv \
--output "test_cavatica_pbta"

Please review code used to parse CSQ column and to select unique gene annotations

Is there anything that you want to discuss further?

This script should be robust in cases where VEP CSQ field is present, although it needs to be tested on additional data sets.

Documentation Checklist

[X] The function has examples to showcase the usage

diskin-lab-chop / AutoGVP

Add gene annotation filtering, final output script #99