LiuzLab / AI_MARRVEL

AI-MARRVEL (AIM) is an AI system for rare genetic disorder diagnosis
GNU General Public License v3.0
8 stars 5 forks source link

Support for gVCF Conversion and Index Building in Nextflow Pipeline #56

Closed hyunhwan-bcm closed 1 month ago

hyunhwan-bcm commented 1 month ago

Summary:

We propose adding support for gVCF files in our Nextflow pipeline. This feature includes building a reference genome index (for both hg19 and hg38) and converting gVCF files to VCF format when <NON_REF> tags are encountered - I believe this is specific for gvcf file by gatk. The conversion process will be part of a new VCF_PRE_PROCESS process, ensuring streamlined integration into the pipeline.

Tasks:

  1. Build Reference Index for hg19 and hg38:

    • Either use pre-built reference indexes or build them during the pipeline's execution (only once).
    • Here’s a basic script for building the index:
      #!/bin/bash
      wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
      gunzip hg19.fa.gz
      sed 's/>chr/>/g' hg19.fa > num_prefix_hg19.fa
      samtools faidx num_prefix_hg19.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M > final_hg19.fa
      gatk CreateSequenceDictionary R=final_hg19.fa
  2. gVCF to VCF Conversion:

    • Check if the input VCF contains <NON_REF> entries within the first 10,000 lines.
    • If found, convert gVCF to VCF using the following steps:
      • Rename chromosomes and remove the ID field.
      • Sort the VCF.
      • Use GATK's GenotypeGVCFs to process the file.
      • Apply VariantFiltration to filter out low-quality variants.
    • The following bash script demonstrates the conversion process:
      
      #!/bin/bash
      set -e

    Define input/output paths and reference genome

    chrmap_file="/home/hwan/aim_data_dependencies_04_24_2024_sasi/bcf_annotate/chrmap.txt" reference_genome="./ref/final_hg19.fa" input_file="$1" output_file="${input_file%.vcf.gz}.no_g.vcf.gz"

    Step 0: Check for

    zcat "$input_file" | head -n 10000 | grep -q "" || { echo "It's not gVCF"; exit 1; }

    Step 1: Annotate and remove ID field

    echo "Step 1: Annotating and removing ID field..." | tee -a "$log_file" bcftools annotate --rename-chrs "$chrmap_file" -x ID "$input_file" -Oz -o step1.vcf.gz 2>> "$log_file" tabix step1.vcf.gz bcftools view -r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y step1.vcf.gz -Oz -o step1_1.vcf.gz

    Step 2: Sort the VCF file

    echo "Step 2: Sorting the VCF file..." | tee -a "$log_file" bcftools sort step1_1.vcf.gz -Oz -o step2.vcf.gz 2>> "$log_file"

    Step 2.1: Index step2.vcf.gz with tabix

    echo "Indexing step2.vcf.gz with tabix..." | tee -a "$log_file" tabix -p vcf step2.vcf.gz 2>> "$log_file"

    Step 3: Genotype GVCFs with GATK

    echo "Step 3: Running GenotypeGVCFs with GATK..." | tee -a "$log_file" gatk GenotypeGVCFs -R "$reference_genome" -V step2.vcf.gz -O step3.vcf.gz --allow-old-rms-mapping-quality-annotation-data 2>> "$log_file"

    Step 4: Filter based on quality with GATK

    echo "Step 4: Running VariantFiltration with GATK..." | tee -a "$log_file" gatk VariantFiltration -V step3.vcf.gz -O step4.vcf.gz --filter-expression "QUAL < 30.0" --filter-name "LowQual" 2>> "$log_file"

    Rename the final output file

    echo "Renaming the final output file..." | tee -a "$log_file" mv step4.vcf.gz "$output_file"

    Index the final output using tabix

    echo "Indexing the final output with tabix..." | tee -a "$log_file" tabix -p vcf "$output_file" 2>> "$log_file"

    Display the number of non-comment lines (ignore lines starting with #)

    lines=$(zcat "$output_file" | grep -v '^#' | wc -l) echo "File: $output_file has $lines lines (excluding comment lines)." | tee -a "$log_file"

    Clean up intermediate files

    echo "Cleaning up intermediate files..." | tee -a "$log_file" rm -f step.vcf.gz

    Calculate and print the running time

    end_time=$(date +%s) runtime=$((end_time - start_time)) echo "Script completed in $runtime seconds." | tee -a "$log_file"

  3. Incorporation into VCF_PRE_PROCESS:

    • The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.
    • Ensure efficient handling of large datasets and minimize redundant computation.

Benefits:

  • Streamlines the conversion of gVCF to VCF files while preserving necessary data.

Additional Information:

  • The reference genome index needs to be built only once per workflow.
  • Please ensure that all file paths and environment dependencies are correctly defined before integrating the scripts.

Would you like to make any modifications or additions to this draft before posting it?

jylee-bcm commented 1 month ago

Hi, @hyunhwan-bcm can you also review the other options listed here: https://chatgpt.com/share/43eb88c5-9850-49a7-a0db-793289daea0f

I think it would be great if we can use vcftools or bcftools options so we don't need to install, and it does not require us to get reference files.

jylee-bcm commented 1 month ago

Sorry but I am in difficulty at understanding the meaning of Taks2 and Task3.

It seems the task2 already incorporated some logics of VCF_PRE_PROCESS, but there is something missed too.

Those are in

  1. bcftools annotate --rename-chrs ... command

Those are missed

  1. bcftools annotate --set-id +'%CHROM\\_%POS\\_%REF\\_%FIRST_ALT' command
  2. bcftools filter ${params.run_id}-add-id.vcf.gz -i'FILTER == "PASS"'

Is it intended as they are already covered by new commands? or just mistakes by ChatGPT?

hyunhwan-bcm commented 1 month ago

This needs to be executed before VCF_PRE_PROCESS, and bcftools annotate --rename-chrs is required to drastically reduce many non-variant records. After that, we will proceed with the other two parts.

jylee-bcm commented 1 month ago

The gVCF to VCF conversion logic will be added to the VCF_PRE_PROCESS stage.

Then this is irrelevant, and safe to be ignored, right?