broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
https://broadinstitute.github.io/gatk-sv/
BSD 3-Clause "New" or "Revised" License
171 stars 72 forks source link

Use GATK to retrieve VCF records in JoinContigFromRemoteVcfs #52

Open mwalker174 opened 4 years ago

mwalker174 commented 4 years ago

This almost-working code can replace remote tabix, but there are a few issues:

  1. Downstream scripts expect empty format fields to be present (as ".") but GATK removes them
  2. duplicate records are problematic for bcftools annotate
  3. fields should only be replaced when missing (depth-only calls)
    touch subsetted_vcfs.list
    paste ~{write_lines(batches)} ~{write_lines(vcfs)} | while read BATCH VCF_PATH; do
      java -Xmx~{java_mem_mb}M -jar ${GATK_JAR} SelectVariants \
        -V "${VCF_PATH}" \
        -L "~{contig}" \
        -O "tmp.vcf.gz"

      # GATK removed empty FORMAT fields
      bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t[.\t]\n" tmp.vcf.gz | bgzip -c > ann.tab.gz
      tabix -s1 -b2 -e2 ann.tab.gz
      bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/PE_GT tmp.vcf.gz \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/PE_GQ \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/SR_GT \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/SR_GQ \
        | sed "s/AN=[0-9]*;//g" \
        | sed "s/AC=[0-9]*;//g" \
        | bgzip -c \
        > $BATCH.~{contig}.vcf.gz
      rm tmp.vcf.gz ann.tab.gz
      tabix $BATCH.~{contig}.vcf.gz
      echo "$BATCH.~{contig}.vcf.gz" >> subsetted_vcfs.list
    done
jingydz commented 3 years ago

${GATK_JAR} Hello, I would like to ask what the version of '${GATK_JAR}' should be?

mwalker174 commented 2 years ago

Hello, I would like to ask what the version of '${GATK_JAR}' should be?

I believe I was working with any version >=4.1, but >=4.0 may be enough