bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
192 stars 53 forks source link

makeSomaticScripts.py "single" generated command incorrectly passes vcf.gz file to option -d of lofreq call. #134

Closed harrymatthews50 closed 3 months ago

harrymatthews50 commented 3 months ago

Thank you for the amazing toolbox! I have noted something odd in the scripts generated to run lofreq call by makeSomaticScrtipts.py

Running 'makeSomaticScripts.py single:

#!/usr/bin/bash
home_dir=/home/h600s/somaticseq_bug_report
docker run --rm -v ${home_dir}:${home_dir} -v /mounts/csf-cfdna-snakemake/:/mounts/csf-cfdna-snakemake/ -u $UID --memory 200G lethalfang/somaticseq:latest /bin/bash -c \
"makeSomaticScripts.py single \
--genome-reference /mounts/csf-cfdna-snakemake/resources/hg38/v0/Homo_sapiens_assembly38.fasta   \
--inclusion-region /mounts/csf-cfdna-snakemake/resources/bed/NPHD2022A_3383431_Covered_adaptedChrom_padded_2_hg38.bed \
--bam /mounts/csf-cfdna-snakemake/workflows/preprocessing/results/L1283/DEDUPED_BAM_1/345705.aligned.dedup.bam \
--sample-name 345705 \
--output-directory /home/h600s/somaticseq_bug_report/ \
--dbsnp-vcf /mounts/csf-cfdna-snakemake/resources/funcotator_dat_source_in_use/dbsnp/hg38/hg38_All_20180418.vcf.gz \
--minimum-VAF 0.0025 \
--run-mutect2 --run-varscan2 --run-lofreq --run-vardict --run-scalpel --run-strelka2 --exome-setting \
--run-somaticseq \
--run-workflow --by-caller"
wait

Generates the following .cmd file to run lofreq call

#!/bin/bash

#!/bin/bash

#$ -o /home/h600s/somaticseq_bug_report/logs
#$ -e /home/h600s/somaticseq_bug_report/logs
#$ -S /bin/bash
#$ -l h_vmem=12G
set -e

echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2

docker run  -v /mounts/csf-cfdna-snakemake/workflows/preprocessing/results/L1283/DEDUPED_BAM_1:/e26d27134e534bd99c97b44789204584 -v /mounts/csf-cfdna-snakemake/resources/hg38/v0:/4709774e20f54b52b583728f26ffdc7b -v /home/h600s:/0f1242a6724f47e0983adc367a3822d3 -v /mounts/csf-cfdna-snakemake/resources/bed:/4d4ed19d68334f02bf2310e16982e915 -v /mounts/csf-cfdna-snakemake/resources/funcotator_dat_source_in_use/dbsnp/hg38:/f09e876e2db840efbdfd2c775bf86bad -u $(id -u):$(id -g) --rm  lethalfang/lofreq:2.1.3.1-1 \
lofreq call \
--call-indels \
-l /4d4ed19d68334f02bf2310e16982e915/NPHD2022A_3383431_Covered_adaptedChrom_padded_2_hg38.bed \
-f /4709774e20f54b52b583728f26ffdc7b/Homo_sapiens_assembly38.fasta \
-o /0f1242a6724f47e0983adc367a3822d3/somaticseq_bug_report/LoFreq.vcf \
-d /f09e876e2db840efbdfd2c775bf86bad/hg38_All_20180418.vcf.gz \
/e26d27134e534bd99c97b44789204584/345705.aligned.dedup.bam

echo -e "Done at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2

Passing a vcf.gz file of germline variants to -d makes sense for lofreq somatic, but the option -d is supposed to specify the maximum depth for lofreq call. See documentation below.

Usage: lofreq call [options] in.bam Options:

  • Reference: -f | --ref FILE Indexed reference fasta file (gzip supported) [null]
  • Output: -o | --out FILE Vcf output file [- = stdout]
  • Regions: -r | --region STR Limit calls to this region (chrom:start-end) [null] -l | --bed FILE List of positions (chr pos) or regions (BED) [null]
  • Base-call quality: -q | --min-bq INT Skip any base with baseQ smaller than INT [6] -Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6] -R | --def-alt-bq INT Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0] -j | --min-jq INT Skip any base with joinedQ smaller than INT [0] -J | --min-alt-jq INT Skip alternate bases with joinedQ smaller than INT [0] -K | --def-alt-jq INT Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]
  • Base-alignment (BAQ) and indel-aligment (IDAQ) qualities: -B | --no-baq Disable use of base-alignment quality (BAQ) -A | --no-idaq Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging) -D | --del-baq Delete pre-existing BAQ values, i.e. compute even if already present in BAM -e | --no-ext-baq Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)
  • Mapping quality: -m | --min-mq INT Skip reads with mapping quality smaller than INT [0] -M | --max-mq INT Cap mapping quality at INT [255] -N | --no-mq Don't merge mapping quality in LoFreq's model
  • Indels: --call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!) --only-indels Only call indels; no SNVs
  • Source quality: -s | --src-qual Enable computation of source quality -S | --ign-vcf FILE Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas -T | --def-nm-q INT If >= 0, then replace non-match base qualities with this default value [-1]
  • P-values: -a | --sig P-Value cutoff / significance level [0.010000] -b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
  • Misc.: -C | --min-cov INT Test only positions having at least this coverage [1] (note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done) -d | --max-depth INT Cap coverage at this depth [1000000] --illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded --use-orphan Count anomalous read pairs (i.e. where mate is not aligned properly) --plp-summary-only No variant calling. Just output pileup summary per column --no-default-filter Don't run default 'lofreq filter' automatically after calling variants --force-overwrite Overwrite any existing output --verbose Be verbose --debug Enable debugging

The command executes without errors, but it raises the question of how/if germline variants are being screened when using the lofreq caller for tumour-only samples.

litaifang commented 3 months ago

Thanks for pointing that out. I'll look into it in the next few days and then fix that.

litaifang commented 3 months ago

I removed -d *.vcf.gz from LoFreq's single sample mode in the latest master branch. Will soon release a version with that.