diskin-lab-chop / AutoGVP

19 stars 3 forks source link

Parse VCF file so that INFO subfields are tab-separated columns #74

Closed rjcorb closed 1 year ago

rjcorb commented 1 year ago

Purpose/implementation Section

What feature is being added or bug is being addressed?

This PR adds parse_vcf.sh bash script to parse all columns and info fields from VCF file into tab-separated columns for downstream workflow output generation.

What was your approach?

parse_vcf.sh first extracts all INFO subfields and formats into a single character string to be used as input for bcftools query. bcftools query is used to parse all vcf columns and INFO subfields from vcf file, and output is written as *.parsed.vcf.

What GitHub issue does your pull request address?

73

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Please run script using the test input vcf file as follows:

bash parse_vcf.sh input/test-parsing.vcf

Which areas should receive a particularly close look?

Ensure script runs and that output looks as expected (tab-delimited, one column per INFO subfield)

Is there anything that you want to discuss further?

The file column names currently start with [<column no.>]. I think this would be easier to remove in a subsequent R script in which this file will be merged with AutoGVP Rscript output.

Documentation Checklist

naqvia commented 1 year ago

I tried to run the script but I am getting the following error: Could not parse format string: %CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%\n

It is pointing to line 24. The column headers do match, so I am not sure whats going on...

rjcorb commented 1 year ago

I tried to run the script but I am getting the following error: Could not parse format string: %CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%\n

It is pointing to line 24. The column headers do match, so I am not sure whats going on...

Should run now

naqvia commented 1 year ago

Yes, that worked!