ekg / hhga

haplotypes genotypes and alleles example decision synthesizer
MIT License
20 stars 3 forks source link

document and reproduce the precisionFDA pipeline #23

Open ekg opened 8 years ago

ekg commented 8 years ago

The basic idea is to make two sets of calls, one from global assembly of the input FASTQs (fermikit), and another from global alignment and variant calling (freebayes).

We run fermikit on the FASTQs, then take its raw VCF output as a list of candidates. (likely private: https://precision.fda.gov/apps/app-BvJPP100469368x7QvJkKG9Y/fork, but documented here)

https://github.com/lh3/fermikit fermikit by Heng Li is a de novo assembly based variant calling pipeline for Illumina short reads. Instead of mapping reads to the reference genome, fermikit constructs a de novo assembly of the reads into uniquely assemble-able contigs or "unitigs" which preserve distinct haplotypes to the extent possible and unambiguous. Subsequently, fermikit aligns these unitigs to the reference genome and calls both small variants and structural variants.

read_reads=$(find /work/in -type f -path "/work/in/reads*/*" | xargs -n 100 echo zcat)

/fermi.kit/fermi2.pl unitig "-s${estimated_genome_size}" -t$(nproc) "-l${read_length}" -p genome "$read_reads" > genome.mak
cat genome.mak
make -f genome.mak
zcat /work/in/reference_genome/* > reference_genome.fa
samtools faidx reference_genome.fa
/fermi.kit/bwa index reference_genome.fa
/fermi.kit/run-calling -t$(nproc) reference_genome.fa genome.mag.gz | sh

ls -lh

mv genome.srt.bam "${output_name}.bam"
emit unitigs_bam "${output_name}.bam"

mv genome.raw.vcf.gz "${output_name}.raw.vcf.gz"
emit raw_vcf "${output_name}.raw.vcf.gz"

mv genome.flt.vcf.gz "${output_name}.flt.vcf.gz"
emit filtered_vcf "${output_name}.flt.vcf.gz"

mv genome.sv.vcf.gz "${output_name}.sv.vcf.gz"
emit sv_vcf "${output_name}.sv.vcf.gz"

zip "${output_name}.fermikit.logs.zip" *.log
emit logs "${output_name}.fermikit.logs.zip"

We run bwa mem on the FASTQs and reference genome, then mark PCR duplicates in the alignments. (https://precision.fda.gov/apps/app-BpF9YGQ0bxjbjk6Fx1F0KJF0/fork)

This app maps Illumina FASTQ reads to the human reference genome (hs37d5 or GRCh37), using BWA-MEM. The mappings are then sorted, indexed and marked for duplicates using bamsormadup from biobambam2.

# If no output name was given, construct one from the reads
if [[ "$output_name" == "" ]]; then
  output_name="$reads_prefix"
  # Reads are often called "something_1.fastq.gz"; remove "_1"
  output_name="${output_name%_1}"
  output_name="${output_name%_R1}"
fi

# Clean up sample
sample="${sample//[^-.0-9@A-Z_a-z]/}"

# If no read group was given, use same value as sample
if [[ "$read_group_id" == "" ]]; then
  read_group_id="$sample"
fi

# Set up BWA options
opts=()
opts+=("-t" "`nproc`")
opts+=("-R" "@RG\tID:${read_group_id}\tPL:ILLUMINA\tSM:${sample}")
if [[ "$mark_as_secondary" == "true" ]]; then
  opts+=("-M")
fi
if [[ "$use_hs37d5" == "true" ]]; then
  opts+=("hs37d5.fa")
else
  opts+=("grch37.fa")
fi
opts+=("$reads_path")
if [[ "$reads2" != "" ]]; then
  opts+=("$reads2_path")
fi

bwa mem "${opts[@]}" | /opt/biobambam2/bin/bamsormadup inputformat=sam indexfilename="$output_name".bam.bai M="$output_name".duplication_metrics > "$output_name".bam

emit sorted_bam "$output_name".bam
emit sorted_bai "$output_name".bam.bai
emit duplication_metrics "$output_name".duplication_metrics

We then derive a VCF from this output using freebayes. This is very fast and sometimes it made sense to reconfigure it, so in practice I've done so in line with the HHGA generation, which is a bit messy. This is the pipeline I'd use to do so:

ref=hs37d5.fa    # we're using GRCh37 plus the "decoy" sequences
region=22:100-200  # dummy region, for example
freebayes -f $ref --region $region --no-partial-observations --min-repeat-entropy 1 -C 2 -F 0.05 $bam \
    | vcffilter -f 'QUAL > 1e-8' | bgzip >freebayes.vcf.gz

Work is parallelized over a list of regions that have roughly equal number of alignments in them for one of the Garvan NA12878 sequencing runs.

I made them like this:

bamprofile -f $ref -b aln.bam -e 1000000 >wg.e1M.regions

This would be the process per region:

#!/bin/bash

name=vial1l
ref=~/graphs/hs37d5.fa
fermikitvcf=NA12878-Garvan-Vial_1.fermikit.raw.vcf.gz
bam=NA12878-Garvan-Vial1.bam
region=$1

mkdir -p "$name"

freebayes -f "$ref" --region $region \
         --no-partial-observations \
         --min-repeat-entropy 1 \
         -C 2 -F 0.05 $bam \
    | vcffilter -f \"QUAL > 1e-8\" \
    | bgziptabix "$name"/{}.freebayes.vcf.gz

tabix -h "$fermikitvcf" {} | bgziptabix "$name"/{}.fermikit.vcf.gz

vcfjoincalls "$ref" "$name"/{}.freebayes.vcf.gz "$name"/{}.fermikit.vcf.gz fermikit "$name" \
    | bgziptabix "$name"/{}.union.vcf.gz

hhga_region \
    -r {} \
    -w 16 \
    -x 10 \
    -f "$ref" \
    -T giab_v2.19.vcf.gz \
    -B giab_callable_v2.19.bed.gz \
    -b "$bam" \
    -v "$name"/{}.union.vcf.gz \
    -o "$name" -S NA12878

The bgziptabix and vcfjoincalls scripts are in vcflib: https://github.com/vcflib/vcflib/blob/master/scripts/vcfjoincalls, https://github.com/vcflib/vcflib/blob/master/scripts/bgziptabix

hhga_region is in this repo: https://github.com/ekg/hhga/blob/master/scripts/hhga_region