brentp / rare-disease-wf

(WIP) best-practices workflow for rare disease
MIT License
58 stars 8 forks source link

Thoughts from a recent implementation #2

Open Phillip-a-richmond opened 3 years ago

Phillip-a-richmond commented 3 years ago

I've recently rewritten our pipeline for build 38, but haven't collected into nf/snakemake yet. I only post this here because some pieces may be relevant, e.g. the pitfalls from slivar for not annotating strings like VCFAnno can. Also several core tools were developed by you so I want to show some appreciation. I heavily looked into slivar as the end-all after getting my normalized VCF and dumping gemini/vcf2db, but I found lots of useful annotation tools within VEP that could then get filtered via gemini when querying from the db.

It's not perfect at all, but it works.

Current pipeline:

  1. bwa mem, mapped to http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/
bwa mem $BWA_INDEX \
        -t $NSLOTS \
        -R "@RG\tID:$SAMPLE_ID\tSM:$SAMPLE_ID\tPL:illumina" \
        -Y -K 100000000 \
        $FASTQR1 \
        $FASTQR2 \
        > $WORKING_DIR$SAMPLE_ID.sam
  1. samtools sort+index

    samtools view -@ $NSLOTS -ubS $WORKING_DIR${SAMPLE_ID}.sam \
        | samtools sort - -@ $NSLOTS  -T $WORKING_DIR${SAMPLE_ID}.sorted -O BAM -o $WORKING_DIR${SAMPLE_ID}.sorted.bam
    samtools index $WORKING_DIR${SAMPLE_ID}.sorted.bam
  2. Picard mark-dup (still needed), reindex

    
    java -jar $EBROOTPICARD/picard.jar MarkDuplicates \
        R=$BWA_INDEX \
        I=$WORKING_DIR${SAMPLE_ID}.sorted.bam \
        O=$WORKING_DIR${SAMPLE_ID}.dupremoved.sorted.bam \
        REMOVE_DUPLICATES=false \
        M=$WORKING_DIR${SAMPLE_ID}.duplicateMetrics.txt

samtools index $WORKING_DIR${SAMPLE_ID}.dupremoved.sorted.bam


4. DeepTrio --> GVCF --> GLNexus (parameterized by WGS/WES) --> BCFTools view

singularity run -B /usr/lib/locale/:/usr/lib/locale/ \ -B "${BAM_DIR}":"/bamdir" \ -B "${FASTA_DIR}":"/genomedir" \ -B "${OUTPUT_DIR}":"/output" \ docker://google/deepvariant:deeptrio-"${BIN_VERSION}" \ /opt/deepvariant/bin/deeptrio/run_deeptrio \ --model_type=WGS \ --ref="/genomedir/$FASTA_FILE" \ --reads_child="/bamdir/$PROBAND_BAM" \ --reads_parent1="/bamdir/$FATHER_BAM" \ --reads_parent2="/bamdir/$MOTHER_BAM" \ --output_vcf_child="/output/$PROBAND_VCF" \ --output_vcf_parent1="/output/$FATHER_VCF" \ --output_vcf_parent2="/output/$MOTHER_VCF" \ --sample_name_child="$PROBAND_SAMPLEID" \ --sample_name_parent1="$FATHER_SAMPLEID" \ --sample_name_parent2="$MOTHER_SAMPLEID" \ --num_shards=$NSLOTS \ --intermediate_results_dir="/output/intermediate_results_dir" \ --output_gvcf_child="/output/$PROBAND_GVCF" \ --output_gvcf_parent1="/output/$FATHER_GVCF" \ --output_gvcf_parent2="/output/$MOTHER_GVCF"

/mnt/common/Precision/GLNexus/glnexus_cli -c DeepVariantWGS \ $PROBAND_GVCF \ $FATHER_GVCF \ $MOTHER_GVCF \ --threads $NSLOTS \

${FAMILY_ID}.glnexus.merged.bcf

bcftools view ${FAMILY_ID}.glnexus.merged.bcf | bgzip -c > ${FAMILY_ID}.glnexus.merged.vcf.gz


5. BCFTools norm + add filter

Normalize VCF

$BCFTOOLS norm \ -f $GENOME_FASTA \ --threads @NSLOTS \ -m - \ -O z \ --output $INVCF_NORM \ $INVCF

$TABIX -f $INVCF_NORM

Filter normalized VCF for high depth + low alt allele support

$BCFTOOLS filter \ --include 'FORMAT/AD[:1]>=5 && FORMAT/DP[] < 600' \ -m + \ -s + \ -O z \ --output $INVCF_NORM_FILTER \ $INVCF_NORM

$TABIX -f $INVCF_NORM_FILTER


7. Clean up RefCall

zgrep -v "RefCall" $INVCF_NORM_FILTER | bgzip -c > $INVCF_NORM_FILTER_NOREFCALL tabix $INVCF_NORM_FILTER_NOREFCALL


8. Slivar for rare filter, only keep passing variants (parameterized for your ideal AF/hom_alt cutoffs (huge speed up for the big boy coming next).   

$SLIVAR/slivar.exe expr --vcf $INVCF_NORM_FILTER_NOREFCALL \ --ped $PED \ --pass-only \ -g $SLIVAR/gnomad.hg38.genomes.v3.fix.zip \ --info "INFO.gnomad_popmax_af < $AF_CUTOFF && INFO.gnomad_nhomalt < $HOM_ALT_CUTOFF && variant.FILTER == \"PASS\" && variant.ALT[0] != \"*\"" \ --js $SLIVAR/slivar/js/slivar-functions.js \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE


9. VEP! (It's all caps because it's intense, add some splicing plugins because splicing is key to diagnosis). 

vep \ -i $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE \ --buffer_size 10000 \ --polyphen b\ --fasta $GENOME_FASTA \ --sift b \ --biotype \ --hgvs \ --protein \ --domains \ --pubmed \ --ccds \ --check_existing \ --nearest symbol \ --gene_phenotype \ --canonical \ --force_overwrite \ --offline \ --cache \ --dir_cache $CACHEDIR \ --dir_plugins $PLUGINSDIR \ --plugin MaxEntScan,$MAXENTSCANDIR \ --plugin SpliceAI,snv=$SPLICEAISNV,indel=$SPLICEAIINDEL \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO \ --compress_output bgzip \ --fork 20 \ --vcf


10. Split-VEP (this was a killer to figure out, but I'm happy I found it.

bcftools +split-vep \ -p vep \ -a CSQ \ -O z \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT \ -c MaxEntScan_alt:Float,\ MaxEntScan_diff:Float,\ MaxEntScan_ref:Float,\ SpliceAI_pred_DP_AG:Float,\ SpliceAI_pred_DP_AL:Float,\ SpliceAI_pred_DP_DG:Float,\ SpliceAI_pred_DP_DL:Float,\ SpliceAI_pred_DS_AG:Float,\ SpliceAI_pred_DS_AL:Float,\ SpliceAI_pred_DS_DG:Float,\ SpliceAI_pred_DS_DL:Float,\ SpliceAI_pred_SYMBOL:String,\ Existing_variation:String,\ Protein_position:String,\ DOMAINS:String,\ PUBMED:String,\ CLIN_SIG:String,\ -s worst \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO


11. VCFAnno Add the important fields like 2x CADD, some additional gnomAD, the latest clinvar that week, an in-house db.

$VCFANNO -lua $ANNOTVARDIR/VCFAnno/custom.lua \ -p $NSLOTS -base-path /mnt/common/DATABASES/REFERENCES/GRCh38/ \ $ANNOTVARDIR/VCFAnno/VCFAnno_Config_20210222_GPCC_NoSpliceAI_GRCh38.toml \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT > $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO


12. Fix the header for the VEP annotations.

sed -e 's/vepMaxEntScan_alt,Number=./vepMaxEntScan_alt,Number=1/g' \ -e 's/vepMaxEntScan_diff,Number=./vepMaxEntScan_diff,Number=1/g' \ -e 's/vepMaxEntScan_ref,Number=./vepMaxEntScan_ref,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AG,Number=./vepSpliceAI_pred_DP_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AL,Number=./vepSpliceAI_pred_DP_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DG,Number=./vepSpliceAI_pred_DP_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DL,Number=./vepSpliceAI_pred_DP_DL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AG,Number=./vepSpliceAI_pred_DS_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AL,Number=./vepSpliceAI_pred_DS_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DG,Number=./vepSpliceAI_pred_DS_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DL,Number=./vepSpliceAI_pred_DS_DL,Number=1/g' \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO \

$INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO_NUMBERFIX

  1. VCF2DB

    $VCF2DB \
    --expand gt_quals --expand gt_depths --expand gt_alt_depths --expand gt_ref_depths --expand gt_types \
    --a-ok AF --a-ok AC --a-ok AN \
    --a-ok gnomad_genome_ac_global --a-ok gnomad_genome_af_global --a-ok gnomad_genome_hom_global --a-ok VAF \
    --a-ok cosmic_count_observed --a-ok gnomad_genome_an_global --a-ok gnomad_nhomalt \
    $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO_NUMBERFIX $PED $GEMINIDB
  2. Gemini queries across inheritance patterns (that's a lot of calls within a shell script, but nice because it can handle all inheritance patterns automagically).

  3. Some finalizing scripts in python for basic table reshuffling, adding HPO/OMIM/MESHOPs tabs based on gene, and then an Excel macro to bring it to a close for a fancily formatted excel spreadsheet.

brentp commented 3 years ago

wow. thanks for sharing this. I didn't know about +split-vep plugin. I am curious what's stopping you from using slivar as the final step. It doesn't populate strings by default, but you can have it do so with: export SLIVAR_FORMAT_STRINGS=1 Of course, I understand that vcf2db+gemini is just working for you, but want to resolve any shortcomings.

How much do the spliceAI and other annotations help? I'd like to collect the most effective annotations but haven't seen much actual help when filtering on spliceAI