bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Error when running makeSomaticScripts with multiple threads #125

Closed GACGAMA closed 11 months ago

GACGAMA commented 12 months ago

Hi! I`m running SomaticSeq

makeSomaticScripts.py single --bam /scratch/bams/a.bam --genome-reference /scratch/references/GRCh38_full_analysis_set_plus_decoy_hla.fa --output-directory /scratch/somaticseq/test/a/--dbsnp-vcf /scratch/references/Homo_sapiens_assembly38.dbsnp138.vcf.gz --container-tech singularity --threads 2 --run-mutect2 --run-vardict --run-lofreq --run-scalpel --run-strelka2 --run-somaticseq --run-workflow

And I`m getting:

INFO 2023-07-10 23:38:33,282 SomaticSeq           SomaticSeq Input Arguments: output_directory=/e73c3c5d6c32413eb115298282cbe0ca/1/SomaticSeq, genome_reference=/3e9c398aa338497f81decea731581d91/GRCh38_full_analysis_set_plus_decoy_hla.fa, truth_snv=None, truth_indel=None, classifier_snv=None, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, algorithm=xgboost, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/5ba717967bfc4dbeb6099c48e915fd19/Homo_sapiens_assembly38.dbsnp138.vcf.gz, cosmic_vcf=None, inclusion_region=/aa432f4ba4d84df9b56cf51ee98a3f3f/1.bed, exclusion_region=None, threads=1, somaticseq_train=False, seed=0, tree_depth=12, iterations=None, features_excluded=[], extra_hyperparameters=None, keep_intermediates=False, bam_file=/41fc4366b0ff4de2af93cc6d57191a55/511869-0424960671.bam, sample_name=TUMOR, mutect_vcf=None, mutect2_vcf=/e73c3c5d6c32413eb115298282cbe0ca/1/MuTect2.vcf, varscan_vcf=None, vardict_vcf=/e73c3c5d6c32413eb115298282cbe0ca/1/VarDict.vcf, lofreq_vcf=/e73c3c5d6c32413eb115298282cbe0ca/1/LoFreq.vcf, scalpel_vcf=/e73c3c5d6c32413eb115298282cbe0ca/1/Scalpel.vcf, strelka_vcf=/e73c3c5d6c32413eb115298282cbe0ca/1/Strelka/results/variants/variants.vcf.gz, arbitrary_snvs=[], arbitrary_indels=[], which=single
INFO 2023-07-10 23:38:33,283 SomaticSeq           SomaticSeq Input Arguments: output_directory=/4e9a3c756da440a8bb8ce5265fe88203/2/SomaticSeq, genome_reference=/36c63d7e2b484e8094e8382881a2dada/GRCh38_full_analysis_set_plus_decoy_hla.fa, truth_snv=None, truth_indel=None, classifier_snv=None, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, algorithm=xgboost, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/c2e3cbff71d74c4cadddfec1a9cca095/Homo_sapiens_assembly38.dbsnp138.vcf.gz, cosmic_vcf=None, inclusion_region=/b2015e83abae40629e4084950925332c/2.bed, exclusion_region=None, threads=1, somaticseq_train=False, seed=0, tree_depth=12, iterations=None, features_excluded=[], extra_hyperparameters=None, keep_intermediates=False, bam_file=/c3fc9296c4ec4aac820aea8cb8d01a31/511869-0424960671.bam, sample_name=TUMOR, mutect_vcf=None, mutect2_vcf=/4e9a3c756da440a8bb8ce5265fe88203/2/MuTect2.vcf, varscan_vcf=None, vardict_vcf=/4e9a3c756da440a8bb8ce5265fe88203/2/VarDict.vcf, lofreq_vcf=/4e9a3c756da440a8bb8ce5265fe88203/2/LoFreq.vcf, scalpel_vcf=/4e9a3c756da440a8bb8ce5265fe88203/2/Scalpel.vcf, strelka_vcf=/4e9a3c756da440a8bb8ce5265fe88203/2/Strelka/results/variants/variants.vcf.gz, arbitrary_snvs=[], arbitrary_indels=[], which=single
Error: Unable to open file /4e9a3c756da440a8bb8ce5265fe88203/2/MuTect2.vcf. Exiting.

AND

NFO 2023-07-10 23:49:58,869 run_script bash /scratch/somaticseq/test/a/logs/mergeResults.2023.07.10.13.39.19.746.cmd Start at 2023/07/10 23:49:58 ^[[34mINFO: ^[[0m Using cached SIF image Traceback (most recent call last): File "/usr/local/bin/concat.py", line 201, in <module> vcf(args.input_files, args.output_file, args.bgzip_output) File "/usr/local/bin/concat.py", line 28, in vcf with genome.open_textfile(file_i) as vcfin: File "/usr/local/lib/python3.10/dist-packages/somaticseq/genomicFileHandler/genomic_file_handlers.py", line 173, in open_textfile return open(file_name) FileNotFoundError: [Errno 2] No such file or directory: '/5c34180f950d41bfaa95dfc8416c8199/a/2/MuTect2.vcf' INFO 2023-07-10 23:49:59,687 run_script FINISHED RUNNING /scratch/ggama1/somaticseq/test/511869-0424960671/logs/mergeResults.2023.07.10.13.39.19.746.cmd in 0.818 seconds with an exit code of 1.

This is not producing merged consensus.vcf, but I can see the consensus VCFs in /1/SomaticSeq and /2/SomaticSeq

It seems like MuTecT2 is not working, but the -run-mutect2 is exiting with code of 0.

I got what the problem is. MuTect2 is memory hungry and not all threads execute finely without modifying the java memory. Is there any plan to include specifically java memory arguments to mutect2 with makescripts.py? I`m thinking of modifying the MuTect2 script generators to include this change

litaifang commented 12 months ago

Try to get into one of the mutect directory and execute the mutect2 .cmd file, and see what message you get.

GACGAMA commented 11 months ago

GATK was spitting some memory errors which were hidden by parallel outputs. After running each sample single threaded, I identified the error! Increasing the java memory in MuTec2.py to 20GB instead of 8 did the job. I would suggest a simple implementation of memory argument to --run-mutect2!

litaifang commented 11 months ago

Yeah I'll try to do that in the next iteration.