Support for gVCF Conversion and Index Building in Nextflow Pipeline

hyunhwan-bcm commented 1 month ago

Summary:

We propose adding support for gVCF files in our Nextflow pipeline. This feature includes building a reference genome index (for both hg19 and hg38) and converting gVCF files to VCF format when <NON_REF> tags are encountered - I believe this is specific for gvcf file by gatk. The conversion process will be part of a new VCF_PRE_PROCESS process, ensuring streamlined integration into the pipeline.

Tasks:
Build Reference Index for hg19 and hg38:
Either use pre-built reference indexes or build them during the pipeline's execution (only once).
Here’s a basic script for building the index:
#!/bin/bash
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip hg19.fa.gz
sed 's/>chr/>/g' hg19.fa > num_prefix_hg19.fa
samtools faidx num_prefix_hg19.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M > final_hg19.fa
gatk CreateSequenceDictionary R=final_hg19.fa
gVCF to VCF Conversion:
Check if the input VCF contains <NON_REF> entries within the first 10,000 lines.

If found, convert gVCF to VCF using the following steps:

Rename chromosomes and remove the ID field.

Sort the VCF.

Use GATK's GenotypeGVCFs to process the file.

Apply VariantFiltration to filter out low-quality variants.
The following bash script demonstrates the conversion process:
#!/bin/bash
set -e
Define input/output paths and reference genome

chrmap_file="/home/hwan/aim_data_dependencies_04_24_2024_sasi/bcf_annotate/chrmap.txt" reference_genome="./ref/final_hg19.fa" input_file="$1" output_file="${input_file%.vcf.gz}.no_g.vcf.gz"

Step 0: Check for

zcat "$input_file" | head -n 10000 | grep -q "" || { echo "It's not gVCF"; exit 1; }

Step 1: Annotate and remove ID field

echo "Step 1: Annotating and removing ID field..." | tee -a "$log_file" bcftools annotate --rename-chrs "$chrmap_file" -x ID "$input_file" -Oz -o step1.vcf.gz 2>> "$log_file" tabix step1.vcf.gz bcftools view -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y step1.vcf.gz -Oz -o step1_1.vcf.gz

Step 2: Sort the VCF file

echo "Step 2: Sorting the VCF file..." | tee -a "$log_file" bcftools sort step1_1.vcf.gz -Oz -o step2.vcf.gz 2>> "$log_file"

Step 2.1: Index step2.vcf.gz with tabix

echo "Indexing step2.vcf.gz with tabix..." | tee -a "$log_file" tabix -p vcf step2.vcf.gz 2>> "$log_file"

Step 3: Genotype GVCFs with GATK

echo "Step 3: Running GenotypeGVCFs with GATK..." | tee -a "$log_file" gatk GenotypeGVCFs -R "$reference_genome" -V step2.vcf.gz -O step3.vcf.gz --allow-old-rms-mapping-quality-annotation-data 2>> "$log_file"

Step 4: Filter based on quality with GATK

echo "Step 4: Running VariantFiltration with GATK..." | tee -a "$log_file" gatk VariantFiltration -V step3.vcf.gz -O step4.vcf.gz --filter-expression "QUAL < 30.0" --filter-name "LowQual" 2>> "$log_file"

Rename the final output file

echo "Renaming the final output file..." | tee -a "$log_file" mv step4.vcf.gz "$output_file"

Index the final output using tabix

echo "Indexing the final output with tabix..." | tee -a "$log_file" tabix -p vcf "$output_file" 2>> "$log_file"

Display the number of non-comment lines (ignore lines starting with #)

lines=$(zcat "$output_file" | grep -v '^#' | wc -l) echo "File: $output_file has $lines lines (excluding comment lines)." | tee -a "$log_file"

Clean up intermediate files

echo "Cleaning up intermediate files..." | tee -a "$log_file" rm -f step.vcf.gz

Calculate and print the running time

end_time=$(date +%s) runtime=$((end_time - start_time)) echo "Script completed in $runtime seconds." | tee -a "$log_file"
Incorporation into VCF_PRE_PROCESS:

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Ensure efficient handling of large datasets and minimize redundant computation.
Benefits:

Streamlines the conversion of gVCF to VCF files while preserving necessary data.

Additional Information:

The reference genome index needs to be built only once per workflow.

Please ensure that all file paths and environment dependencies are correctly defined before integrating the scripts.

Would you like to make any modifications or additions to this draft before posting it?

jylee-bcm commented 1 month ago

Hi, @hyunhwan-bcm can you also review the other options listed here: https://chatgpt.com/share/43eb88c5-9850-49a7-a0db-793289daea0f

I think it would be great if we can use vcftools or bcftools options so we don't need to install, and it does not require us to get reference files.

jylee-bcm commented 1 month ago

Sorry but I am in difficulty at understanding the meaning of Taks2 and Task3.

It seems the task2 already incorporated some logics of VCF_PRE_PROCESS, but there is something missed too.

Those are in

bcftools annotate --rename-chrs ... command

Those are missed

bcftools annotate --set-id +'%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' command
bcftools filter ${params.run_id}-add-id.vcf.gz -i'FILTER == "PASS"'

Is it intended as they are already covered by new commands? or just mistakes by ChatGPT?

hyunhwan-bcm commented 1 month ago

This needs to be executed before VCF_PRE_PROCESS, and bcftools annotate --rename-chrs is required to drastically reduce many non-variant records. After that, we will proceed with the other two parts.

jylee-bcm commented 1 month ago

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Then this is irrelevant, and safe to be ignored, right?

LiuzLab / AI_MARRVEL

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Summary:

Tasks:

Define input/output paths and reference genome

Step 0: Check for

Step 1: Annotate and remove ID field

Step 2: Sort the VCF file

Step 2.1: Index step2.vcf.gz with tabix

Step 3: Genotype GVCFs with GATK

Step 4: Filter based on quality with GATK

Rename the final output file

Index the final output using tabix

Display the number of non-comment lines (ignore lines starting with #)

Clean up intermediate files

Calculate and print the running time

Benefits:

Additional Information:

Those are in

Those are missed