bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

failed prediciton #90

Closed gianfilippo closed 3 years ago

gianfilippo commented 3 years ago

Hi, a prediction call to somaticseq_parallel.py (20 threads) failed on a sample with the error below. Can you please help ?

Traceback (most recent call last): File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 31, in runPaired_by_region run_somaticseq.runPaired(outdir_i, ref, tbam, nbam, tumor_name, normal_name, truth_snv, truth_indel, classifier_snv, classifier_indel, pass_threshold, lowqual_threshold, hom_threshold, het_threshold, dbsnp, cosmic, inclusion, exclusion, mutect, indelocator, mutect2, varscan_snv, varscan_indel, jsm, sniper, vardict, muse, lofreq_snv, lofreq_indel, scalpel, strelka_snv, strelka_indel, tnscope, platypus, min_mq, min_bq, min_caller, somaticseq_train, ensembleOutPrefix, consensusOutPrefix, classifiedOutPrefix, algo, keep_intermediates) File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/run_somaticseq.py", line 88, in runPaired tsv2vcf.tsv2vcf(classifiedSnvTsv, classifiedSnvVcf, snvCallers, pass_score=pass_threshold, lowqual_score=lowqual_threshold, hom_threshold=hom_threshold, het_threshold=het_threshold, single_mode=False, paired_mode=True, normal_sample_name=normal_name, tumor_sample_name=tumor_name, print_reject=True, phred_scaled=True) File "$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/somaticseq/SSeq_tsv2vcf.py", line 110, in tsv2vcf with open(tsv_fn) as tsv, open(vcf_fn, 'w') as vcf: FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "$HOME/.local/bin/somaticseq_parallel.py", line 4, in import('pkg_resources').run_script('SomaticSeq==3.3.0', 'somaticseq_parallel.py') File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 658, in run_script self.require(requires)[0].run_script(script_name, ns) File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/site-packages/pkg_resources/init.py", line 1438, in run_script exec(code, namespace, namespace) File "/gpfs/ycga/$HOME/.local/lib/python3.7/site-packages/SomaticSeq-3.3.0-py3.7.egg/EGG-INFO/scripts/somaticseq_parallel.py", line 113, in subdirs = pool.map(runPaired_by_region_i, bed_splitted) File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/ycga-gpfs/apps/hpc/software/Python/3.7.0-foss-2018b/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value FileNotFoundError: [Errno 2] No such file or directory: '$HOME/test/Sample/18/SSeq.Classified.sSNV.tsv' slurmstepd: error: Detected 3 oom-kill event(s) in step 9794252.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

litaifang commented 3 years ago

It's complaining about memory....

Just a thought, you may try the xgboost algorithm in the latest version. XGBoost is orders of magnitudes faster and uses less memory than the original AdaBoost implemented in R.

So if you install the latest SomaticSeq, you can train for xgboost model from your labeled TSV files (yes, it can take multiple tsv files and will combine them automatically): somatic_xgboost.py train -tsvs sample_1/Ensemble.sSNV.tsv sample_2/Ensemble.sSNV.tsv.... -out SNV.xgboost.classifier -threads 14

To use that classifier on new data: somatic_xgboost.py predict -tsv Ensemble.sSNV.tsv -model SNV.xgboost.classifier -out Predicted.sSNV.tsv

To see different CLI options: somatic_xgboost.py train -h or somatic_xgboost.py predict -h.

Then, if you want to convert that into VCF file: SSeq_tsv2vcf.py -tsv Predicted.sSNV.tsv -vcf Predicted.sSNV.vcf -tools [FILL IN ALL THE TOOLS YOU USED] -all -phred -paired.

litaifang commented 3 years ago

For xgboost training, you may also pass any of the xgboost training parameters as described here by passing them into the program using this format: --extra-params max_leaves:8 max_bin:16, etc.

gianfilippo commented 3 years ago

Hi, thanks for the suggestion. For now I increased the memory allocation and the first failed sample completed. Eventually, I will try xgboost as well. Thanks