Open Phillip-a-richmond opened 3 years ago
wow. thanks for sharing this. I didn't know about +split-vep
plugin.
I am curious what's stopping you from using slivar as the final step. It doesn't populate strings by default, but you can have it do so with: export SLIVAR_FORMAT_STRINGS=1
Of course, I understand that vcf2db+gemini is just working for you, but want to resolve any shortcomings.
How much do the spliceAI and other annotations help? I'd like to collect the most effective annotations but haven't seen much actual help when filtering on spliceAI
I've recently rewritten our pipeline for build 38, but haven't collected into nf/snakemake yet. I only post this here because some pieces may be relevant, e.g. the pitfalls from slivar for not annotating strings like VCFAnno can. Also several core tools were developed by you so I want to show some appreciation. I heavily looked into slivar as the end-all after getting my normalized VCF and dumping gemini/vcf2db, but I found lots of useful annotation tools within VEP that could then get filtered via gemini when querying from the db.
It's not perfect at all, but it works.
Current pipeline:
samtools sort+index
Picard mark-dup (still needed), reindex
samtools index $WORKING_DIR${SAMPLE_ID}.dupremoved.sorted.bam
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \ -B "${BAM_DIR}":"/bamdir" \ -B "${FASTA_DIR}":"/genomedir" \ -B "${OUTPUT_DIR}":"/output" \ docker://google/deepvariant:deeptrio-"${BIN_VERSION}" \ /opt/deepvariant/bin/deeptrio/run_deeptrio \ --model_type=WGS \ --ref="/genomedir/$FASTA_FILE" \ --reads_child="/bamdir/$PROBAND_BAM" \ --reads_parent1="/bamdir/$FATHER_BAM" \ --reads_parent2="/bamdir/$MOTHER_BAM" \ --output_vcf_child="/output/$PROBAND_VCF" \ --output_vcf_parent1="/output/$FATHER_VCF" \ --output_vcf_parent2="/output/$MOTHER_VCF" \ --sample_name_child="$PROBAND_SAMPLEID" \ --sample_name_parent1="$FATHER_SAMPLEID" \ --sample_name_parent2="$MOTHER_SAMPLEID" \ --num_shards=$NSLOTS \ --intermediate_results_dir="/output/intermediate_results_dir" \ --output_gvcf_child="/output/$PROBAND_GVCF" \ --output_gvcf_parent1="/output/$FATHER_GVCF" \ --output_gvcf_parent2="/output/$MOTHER_GVCF"
/mnt/common/Precision/GLNexus/glnexus_cli -c DeepVariantWGS \ $PROBAND_GVCF \ $FATHER_GVCF \ $MOTHER_GVCF \ --threads $NSLOTS \
bcftools view ${FAMILY_ID}.glnexus.merged.bcf | bgzip -c > ${FAMILY_ID}.glnexus.merged.vcf.gz
Normalize VCF
$BCFTOOLS norm \ -f $GENOME_FASTA \ --threads @NSLOTS \ -m - \ -O z \ --output $INVCF_NORM \ $INVCF
$TABIX -f $INVCF_NORM
Filter normalized VCF for high depth + low alt allele support
$BCFTOOLS filter \ --include 'FORMAT/AD[:1]>=5 && FORMAT/DP[] < 600' \ -m + \ -s + \ -O z \ --output $INVCF_NORM_FILTER \ $INVCF_NORM
$TABIX -f $INVCF_NORM_FILTER
zgrep -v "RefCall" $INVCF_NORM_FILTER | bgzip -c > $INVCF_NORM_FILTER_NOREFCALL tabix $INVCF_NORM_FILTER_NOREFCALL
$SLIVAR/slivar.exe expr --vcf $INVCF_NORM_FILTER_NOREFCALL \ --ped $PED \ --pass-only \ -g $SLIVAR/gnomad.hg38.genomes.v3.fix.zip \ --info "INFO.gnomad_popmax_af < $AF_CUTOFF && INFO.gnomad_nhomalt < $HOM_ALT_CUTOFF && variant.FILTER == \"PASS\" && variant.ALT[0] != \"*\"" \ --js $SLIVAR/slivar/js/slivar-functions.js \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE
vep \ -i $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE \ --buffer_size 10000 \ --polyphen b\ --fasta $GENOME_FASTA \ --sift b \ --biotype \ --hgvs \ --protein \ --domains \ --pubmed \ --ccds \ --check_existing \ --nearest symbol \ --gene_phenotype \ --canonical \ --force_overwrite \ --offline \ --cache \ --dir_cache $CACHEDIR \ --dir_plugins $PLUGINSDIR \ --plugin MaxEntScan,$MAXENTSCANDIR \ --plugin SpliceAI,snv=$SPLICEAISNV,indel=$SPLICEAIINDEL \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO \ --compress_output bgzip \ --fork 20 \ --vcf
bcftools +split-vep \ -p vep \ -a CSQ \ -O z \ -o $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT \ -c MaxEntScan_alt:Float,\ MaxEntScan_diff:Float,\ MaxEntScan_ref:Float,\ SpliceAI_pred_DP_AG:Float,\ SpliceAI_pred_DP_AL:Float,\ SpliceAI_pred_DP_DG:Float,\ SpliceAI_pred_DP_DL:Float,\ SpliceAI_pred_DS_AG:Float,\ SpliceAI_pred_DS_AL:Float,\ SpliceAI_pred_DS_DG:Float,\ SpliceAI_pred_DS_DL:Float,\ SpliceAI_pred_SYMBOL:String,\ Existing_variation:String,\ Protein_position:String,\ DOMAINS:String,\ PUBMED:String,\ CLIN_SIG:String,\ -s worst \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO
$VCFANNO -lua $ANNOTVARDIR/VCFAnno/custom.lua \ -p $NSLOTS -base-path /mnt/common/DATABASES/REFERENCES/GRCh38/ \ $ANNOTVARDIR/VCFAnno/VCFAnno_Config_20210222_GPCC_NoSpliceAI_GRCh38.toml \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT > $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO
sed -e 's/vepMaxEntScan_alt,Number=./vepMaxEntScan_alt,Number=1/g' \ -e 's/vepMaxEntScan_diff,Number=./vepMaxEntScan_diff,Number=1/g' \ -e 's/vepMaxEntScan_ref,Number=./vepMaxEntScan_ref,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AG,Number=./vepSpliceAI_pred_DP_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_AL,Number=./vepSpliceAI_pred_DP_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DG,Number=./vepSpliceAI_pred_DP_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DP_DL,Number=./vepSpliceAI_pred_DP_DL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AG,Number=./vepSpliceAI_pred_DS_AG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_AL,Number=./vepSpliceAI_pred_DS_AL,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DG,Number=./vepSpliceAI_pred_DS_DG,Number=1/g' \ -e 's/vepSpliceAI_pred_DS_DL,Number=./vepSpliceAI_pred_DS_DL,Number=1/g' \ $INVCF_NORM_FILTER_NOREFCALL_SLIVARRARE_VEPANNO_SPLIT_VCFANNO \
VCF2DB
Gemini queries across inheritance patterns (that's a lot of calls within a shell script, but nice because it can handle all inheritance patterns automagically).
Some finalizing scripts in python for basic table reshuffling, adding HPO/OMIM/MESHOPs tabs based on gene, and then an Excel macro to bring it to a close for a fancily formatted excel spreadsheet.