I've recently rewritten our pipeline for build 38, but haven't collected into nf/snakemake yet. I only post this here because some pieces may be relevant, e.g. the pitfalls from slivar for not annotating strings like VCFAnno can. Also several core tools were developed by you so I want to show some appreciation. I heavily looked into slivar as the end-all after getting my normalized VCF and dumping gemini/vcf2db, but I found lots of useful annotation tools within VEP that could then get filtered via gemini when querying from the db.

It's not perfect at all, but it works.

Current pipeline:

bwa mem, mapped to http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

bwa mem $BWA_INDEX \
        -t $NSLOTS \
        -R "@RG\tID:$SAMPLE_ID\tSM:$SAMPLE_ID\tPL:illumina" \
        -Y -K 100000000 \
        $FASTQR1 \
        $FASTQR2 \
        > $WORKING_DIR$SAMPLE_ID.sam

samtools sort+index

samtools view -@ $NSLOTS -ubS $WORKING_DIR${SAMPLE_ID}.sam \
    | samtools sort - -@ $NSLOTS  -T $WORKING_DIR${SAMPLE_ID}.sorted -O BAM -o $WORKING_DIR${SAMPLE_ID}.sorted.bam
samtools index $WORKING_DIR${SAMPLE_ID}.sorted.bam

Picard mark-dup (still needed), reindex


java -jar $EBROOTPICARD/picard.jar MarkDuplicates \
    R=$BWA_INDEX \
    I=$WORKING_DIR${SAMPLE_ID}.sorted.bam \
    O=$WORKING_DIR${SAMPLE_ID}.dupremoved.sorted.bam \
    REMOVE_DUPLICATES=false \
    M=$WORKING_DIR${SAMPLE_ID}.duplicateMetrics.txt

samtools index $WORKING_DIR${SAMPLE_ID}.dupremoved.sorted.bam


4. DeepTrio --> GVCF --> GLNexus (parameterized by WGS/WES) --> BCFTools view

singularity run -B /usr/lib/locale/:/usr/lib/locale/ \ -B "${BAM_DIR}":"/bamdir" \ -B "${FASTA_DIR}":"/genomedir" \ -B "${OUTPUT_DIR}":"/output" \ docker://google/deepvariant:deeptrio-"${BIN_VERSION}" \ /opt/deepvariant/bin/deeptrio/run_deeptrio \ --model_type=WGS \ --ref="/genomedir/$FASTA_FILE" \ --reads_child="/bamdir/$PROBAND_BAM" \ --reads_parent1="/bamdir/$FATHER_BAM" \ --reads_parent2="/bamdir/$MOTHER_BAM" \ --output_vcf_child="/output/$PROBAND_VCF" \ --output_vcf_parent1="/output/$FATHER_VCF" \ --output_vcf_parent2="/output/$MOTHER_VCF" \ --sample_name_child="$PROBAND_SAMPLEID" \ --sample_name_parent1="$FATHER_SAMPLEID" \ --sample_name_parent2="$MOTHER_SAMPLEID" \ --num_shards=$NSLOTS \ --intermediate_results_dir="/output/intermediate_results_dir" \ --output_gvcf_child="/output/$PROBAND_GVCF" \ --output_gvcf_parent1="/output/$FATHER_GVCF" \ --output_gvcf_parent2="/output/$MOTHER_GVCF"

/mnt/common/Precision/GLNexus/glnexus_cli -c DeepVariantWGS \ $PROBAND_GVCF \ $FATHER_GVCF \ $MOTHER_GVCF \ --threads $NSLOTS \

${FAMILY_ID}.glnexus.merged.bcf

bcftools view ${FAMILY_ID}.glnexus.merged.bcf | bgzip -c > ${FAMILY_ID}.glnexus.merged.vcf.gz


5. BCFTools norm + add filter

Normalize VCF

$BCFTOOLS norm \ -f $GENOME_FASTA \ --threads @NSLOTS \ -m - \ -O z \ --output $INVCF_NORM \ $INVCF

$TABIX -f $INVCF_NORM

Filter normalized VCF for high depth + low alt allele support

$BCFTOOLS filter \ --include 'FORMAT/AD[:1]>=5 && FORMAT/DP[] < 600' \ -m + \ -s + \ -O z \ --output $INVCF_NORM_FILTER \ $INVCF_NORM

$TABIX -f $INVCF_NORM_FILTER


7. Clean up RefCall

zgrep -v "RefCall" $INVCF_NORM_FILTER | bgzip -c > $INVCF_NORM_FILTER_NOREFCALL tabix $INVCF_NORM_FILTER_NOREFCALL


8. Slivar for rare filter, only keep passing variants (parameterized for your ideal AF/hom_alt cutoffs (huge speed up for the big boy coming next).

$SLIVAR/slivar.exe expr --vcf $INVCF_NORM_FILTER_NOREFCALL \ --ped $PED \ --pass-only \ -g $SLIVAR/gnomad.hg38.genomes.v3.fix.zip \ --info "INFO.gnomad_popmax_af < $AF_CUTOFF && INFO.gnomad_nhomalt < $HOM_ALT_CUTOFF && variant.FILTER == \"PASS\" && variant.ALT[0] != \"*\"" \ --js $SLIVAR/slivar/js/slivar-functions.js \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE


9. VEP! (It's all caps because it's intense, add some splicing plugins because splicing is key to diagnosis).

vep \ -i $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE \ --buffer_size 10000 \ --polyphen b\ --fasta $GENOME_FASTA \ --sift b \ --biotype \ --hgvs \ --protein \ --domains \ --pubmed \ --ccds \ --check_existing \ --nearest symbol \ --gene_phenotype \ --canonical \ --force_overwrite \ --offline \ --cache \ --dir_cache $CACHEDIR \ --dir_plugins $PLUGINSDIR \ --plugin MaxEntScan,$MAXENTSCANDIR \ --plugin SpliceAI,snv=$SPLICEAISNV,indel=$SPLICEAIINDEL \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO \ --compress_output bgzip \ --fork 20 \ --vcf


10. Split-VEP (this was a killer to figure out, but I'm happy I found it.

bcftools +split-vep \ -p vep \ -a CSQ \ -O z \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT \ -c MaxEntScan_alt:Float,\ MaxEntScan_diff:Float,\ MaxEntScan_ref:Float,\ SpliceAI_pred_DP_AG:Float,\ SpliceAI_pred_DP_AL:Float,\ SpliceAI_pred_DP_DG:Float,\ SpliceAI_pred_DP_DL:Float,\ SpliceAI_pred_DS_AG:Float,\ SpliceAI_pred_DS_AL:Float,\ SpliceAI_pred_DS_DG:Float,\ SpliceAI_pred_DS_DL:Float,\ SpliceAI_pred_SYMBOL:String,\ Existing_variation:String,\ Protein_position:String,\ DOMAINS:String,\ PUBMED:String,\ CLIN_SIG:String,\ -s worst \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO


11. VCFAnno Add the important fields like 2x CADD, some additional gnomAD, the latest clinvar that week, an in-house db.

$VCFANNO -lua $ANNOTVARDIR/VCFAnno/custom.lua \ -p $NSLOTS -base-path /mnt/common/DATABASES/REFERENCES/GRCh38/ \ $ANNOTVARDIR/VCFAnno/VCFAnno_Config_20210222_GPCC_NoSpliceAI_GRCh38.toml \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT > $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO


12. Fix the header for the VEP annotations.

sed -e 's/vepMaxEntScan_alt,Number=./vepMaxEntScan_alt,Number=1/g' \ -e 's/vepMaxEntScan_diff,Number=./vepMaxEntScan_diff,Number=1/g' \ -e 's/vepMaxEntScan_ref,Number=./vepMaxEntScan_ref,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AG,Number=./vepSpliceAI_pred_DP_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AL,Number=./vepSpliceAI_pred_DP_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DG,Number=./vepSpliceAI_pred_DP_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DL,Number=./vepSpliceAI_pred_DP_DL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AG,Number=./vepSpliceAI_pred_DS_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AL,Number=./vepSpliceAI_pred_DS_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DG,Number=./vepSpliceAI_pred_DS_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DL,Number=./vepSpliceAI_pred_DS_DL,Number=1/g' \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO \

$INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO_NUMBERFIX

VCF2DB

$VCF2DB \
--expand gt_quals --expand gt_depths --expand gt_alt_depths --expand gt_ref_depths --expand gt_types \
--a-ok AF --a-ok AC --a-ok AN \
--a-ok gnomad_genome_ac_global --a-ok gnomad_genome_af_global --a-ok gnomad_genome_hom_global --a-ok VAF \
--a-ok cosmic_count_observed --a-ok gnomad_genome_an_global --a-ok gnomad_nhomalt \
$INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO_NUMBERFIX $PED $GEMINIDB

Gemini queries across inheritance patterns (that's a lot of calls within a shell script, but nice because it can handle all inheritance patterns automagically).
Some finalizing scripts in python for basic table reshuffling, adding HPO/OMIM/MESHOPs tabs based on gene, and then an Excel macro to bring it to a close for a fancily formatted excel spreadsheet.

brentp / rare-disease-wf

Thoughts from a recent implementation #2

Normalize VCF

Filter normalized VCF for high depth + low alt allele support