bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

somatic_xgboost.py train: xgboost.core.XGBoostError #109

Closed iserf closed 2 years ago

iserf commented 2 years ago

Hi,

I am using SomaticSeq from the dockerized version: lethalfang/somaticseq:latest

Currently, I am trying to train a classifier from several tsv files that were previously created by somaticseq_parallel.py. A error in the xgboost core algorithm occurred: xgboost.core.XGBoostError: [12:57:49] /workspace/src/objective/regression_obj.cu:106: label must be in [0,1] for logistic regression

I run the following commands:

  1. Create tsv files for three samples (274042, 274044, 274046): somaticseq_parallel.py \ --output-directory $OUT_DIR \ --genome-reference $REFERENCE/Homo_sapiens_assembly38.fasta \ --inclusion-region $BED \ --algorithm xgboost \ --threads 24 \ paired \ --tumor-bam-file $BAM_DIR/$BAM_T \ --normal-bam-file $BAM_N \ --tumor-sample $ID_T \ --normal-sample $ID_N \ --mutect2-vcf $mutect2 \ --vardict-vcf $vardict \ --varscan-snv $varscan_snv \ --varscan-indel $varscan_indel \ --strelka-snv $strelka_snv \ --strelka-indel $strelka_indel \ --arbitrary-snvs $octopus_snv \ --arbitrary-indels $octopus_indel

This script runs without any problems

  1. Training a classifier: somatic_xgboost.py train \ -tsvs $tsvs/274042/Ensemble.sSNV.tsv $tsvs/274044/Ensemble.sSNV.tsv $tsvs/274046/Ensemble.sSNV.tsv \ -out $OUT_DIR/multiSample.SNV.classifier \ -threads 8 -depth 12 -seed 42 -method hist -iter 250 \ --extra-params grow_policy:lossguide max_leaves:24

Samples 274042-274046 are certain reference materials which come with a vcf file containing true positive calls which I used as truth sets. These true positives are similar in all samples, but occur at different VAF.

Does somebody see the error in my code which triggers the xgboost crush?

Thanks a lot in advance!

Best wishes,

Flo