bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

AI consensus calling error on WGS samples #131

Open GACGAMA opened 5 months ago

GACGAMA commented 5 months ago

I'm trying to run somaticseq_parallel on some samples VCFs to call the AI consensus. The version for SomaticSeq is SomaticSeq v3.7.3. Version of XGBOOST is 2.0.2 I've run all mutation callers, then, with the VCF files, did the following command:

somaticseq_parallel.py --classifier-snv /scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/SNV_model.classifier --classifier-indel /scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/INDEL_model.classifier --output-directory /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR --genome-reference /scratch4/nsobrei2/references/ncbi_grch38_cipher/GRCh38_full_analysis_set_plus_decoy_hla.fa -dbsnp /scratch4/nsobrei2/references/dbsnp/138_cipher/Homo_sapiens_assembly38.dbsnp138.vcf.gz --threads 38 paired --tumor-bam-file /scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_TUMOR.bam --normal-bam-file /scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_GERMLINE.bam --mutect2-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuTect2.vcf.gz --vardict-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarDict.vcf.gz --somaticsniper-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.SomaticSniper.vcf.gz --muse-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuSE.vcf.gz --strelka-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.snv.vcf.gz --strelka-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.indel.vcf.gz --varscan-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.snv.vcf.gz --varscan-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.indel.vcf.gz --lofreq-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.snv.vcf.gz --lofreq-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.indel.vcf.gz

This is the output with the error


INFO 2024-01-29 21:25:59,514 SomaticSeq           SomaticSeq Input Arguments: output_directory=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR, genome_reference=/scratch4/nsobrei2/references/ncbi_grch38_cipher/GRCh38_full_analysis_set_plus_decoy_hla.fa, truth_snv=None, truth_indel=None, classifier_snv=/scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/SNV_model.classifier, classifier_indel=/scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/INDEL_model.classifier, pass_threshold=0.5, lowqual_threshold=0.1, algorithm=xgboost, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/scratch4/nsobrei2/references/dbsnp/138_cipher/Homo_sapiens_assembly38.dbsnp138.vcf.gz, cosmic_vcf=None, inclusion_region=None, exclusion_region=None, threads=38, somaticseq_train=False, seed=0, tree_depth=12, iterations=None, features_excluded=[], extra_hyperparameters=None, keep_intermediates=False, tumor_bam_file=/scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_TUMOR.bam, normal_bam_file=/scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_GERMLINE.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuTect2.vcf.gz, varscan_snv=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.snv.vcf.gz, varscan_indel=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.indel.vcf.gz, jsm_vcf=None, somaticsniper_vcf=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.SomaticSniper.vcf.gz, vardict_vcf=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarDict.vcf.gz, muse_vcf=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuSE.vcf.gz, lofreq_snv=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.snv.vcf.gz, lofreq_indel=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.indel.vcf.gz, scalpel_vcf=None, strelka_snv=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.snv.vcf.gz, strelka_indel=/scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.indel.vcf.gz, tnscope_vcf=None, platypus_vcf=None, arbitrary_snvs=[], arbitrary_indels=[], which=paired
***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

***** WARNING: File /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR/38.th.input.bed has inconsistent naming convention for record:
HLA-A*01:01:01:01   0   3503

2024-01-29 21:29:24,802 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:29:24,802 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:29:43,208 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:29:43,208 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:29:55,957 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:29:55,957 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:29:57,641 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:29:57,641 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:00,880 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:00,880 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:03,324 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:03,324 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:05,665 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:05,665 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:05,670 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:05,670 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:06,451 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:06,451 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:07,968 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:07,968 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:08,179 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:08,179 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:08,784 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:08,784 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:17,032 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:17,032 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:17,879 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:17,879 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:17,993 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:17,993 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:18,751 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:18,751 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:23,687 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:23,687 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:24,247 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:24,247 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:24,306 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:24,306 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:25,604 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:25,604 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:26,489 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:26,489 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:26,632 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:26,632 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:27,644 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:27,644 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:27,884 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:27,884 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:28,425 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:28,425 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:28,616 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:28,616 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:29,069 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:29,069 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:29,767 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:29,767 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:30,179 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:30,179 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:30,292 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:30,292 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:30,705 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:30,705 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:30,930 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:30,930 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:31,435 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:31,435 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:31,742 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:31,742 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:31,956 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:31,956 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:32,202 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:32,202 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:34,058 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:34,058 somatic_vcf2tsv.py   NO RE-SCALING
2024-01-29 21:30:36,062 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
INFO 2024-01-29 21:30:36,062 somatic_vcf2tsv.py   NO RE-SCALING
INFO 2024-01-29 22:26:09,775 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:26:09,775 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:43:11,993 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:43:11,993 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:44:54,332 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:44:54,332 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:53:46,696 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:53:46,696 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:57:05,534 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:57:05,534 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:57:38,264 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:57:38,264 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 22:58:31,952 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 22:58:31,952 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:00:51,089 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:00:51,089 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:03:32,194 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:03:32,194 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:04:09,206 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:04:09,206 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:07:29,075 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:07:29,075 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:08:31,220 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:08:31,220 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:08:58,106 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:08:58,106 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:09:43,311 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:09:43,311 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:10:37,123 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:10:37,123 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:11:15,132 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:11:15,132 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:14:03,066 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:14:03,066 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:15:32,899 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:15:32,900 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:18:17,799 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:18:17,799 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:18:40,118 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:18:40,118 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:19:20,634 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:19:20,634 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:21:47,766 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:21:47,766 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:30:41,076 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:30:41,076 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:31:09,867 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:31:09,868 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:31:36,892 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:31:36,892 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:32:03,360 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:32:03,361 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:33:42,153 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:33:42,153 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:34:06,125 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:34:06,126 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:45:14,909 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:45:14,909 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-29 23:54:18,994 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-29 23:54:18,994 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:02:30,329 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:02:30,329 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:02:43,281 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:02:43,281 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:04:03,375 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:04:03,375 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:07:53,272 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:07:53,272 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:20:54,193 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:20:54,193 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:26:38,802 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:26:38,802 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:29:20,286 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:29:20,286 xgboost_predictor    Number of trees to use = 100
INFO 2024-01-30 00:38:09,574 xgboost_predictor    Columns removed for prediction: CHROM,POS,ID,REF,ALT,Strelka_QSS,Strelka_TQSS,if_COSMIC,COSMIC_CNT,TrueVariant_or_False
INFO 2024-01-30 00:38:09,575 xgboost_predictor    Number of trees to use = 100
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 84, in runPaired_by_region
    run_somaticseq.runPaired(
  File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 169, in runPaired
    modelPredictor(ensembleSnv, classifiedSnvTsv, algo, classifier_snv, iterations=iterations, features_to_exclude=features_excluded)
  File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 87, in modelPredictor
    somatic_xgboost.predictor(classifier, input_file, output_file, non_features, iterations)
  File "/home/ggama1/programs/somaticseq/somaticseq/somatic_xgboost.py", line 173, in predictor
    scores = xgb_model.predict(dtest, ntree_limit=iterations)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Booster.predict() got an unexpected keyword argument 'ntree_limit'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/bin/somaticseq_parallel.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 308, in <module>
    subdirs = pool.map(runPaired_by_region_i, bed_splitted)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
TypeError: Booster.predict() got an unexpected keyword argument 'ntree_limit'

The output of the created AI model, used in the above code, was:


INFO 2024-01-27 08:52:51,190 xgboost_builder      Columns removed before training: CHROM, POS, ID, REF, ALT, Strelka_QSS, Strelka_TQSS, if_COSMIC, COSMIC_CNT, TrueVariant_or_False
INFO 2024-01-27 08:52:51,190 xgboost_builder      Number of boosting rounds = 1000
INFO 2024-01-27 08:52:51,191 xgboost_builder      Hyperparameters: max_depth=8, nthread=48, objective=binary:logistic, seed=0, tree_method=hist, grow_policy=lossguide
/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py:160: UserWarning: [09:07:04] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
  warnings.warn(smsg, UserWarning)
GACGAMA commented 5 months ago

Using an older version of xgboost (1.7.1 and 1.6) I get different errors, which might be more on the bug side.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 84, in runPaired_by_region
    run_somaticseq.runPaired(
  File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 169, in runPaired
    modelPredictor(ensembleSnv, classifiedSnvTsv, algo, classifier_snv, iterations=iterations, features_to_exclude=features_excluded)
  File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 87, in modelPredictor
    somatic_xgboost.predictor(classifier, input_file, output_file, non_features, iterations)
  File "/home/ggama1/programs/somaticseq/somaticseq/somatic_xgboost.py", line 172, in predictor
    dtest = xgb.DMatrix(test_data)
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 532, in inner_f
    return f(**kwargs)
           ^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 643, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
                                           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 896, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 348, in _from_pandas_df
    return _from_numpy_array(data, missing, nthread, feature_names, feature_types)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 184, in _from_numpy_array
    _check_call(
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 203, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data contains `inf` or `nan`
Stack trace:
  [bt] (0) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x154c79) [0x1554604bfc79]
  [bt] (1) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x179e1d) [0x1554604e4e1d]
  [bt] (2) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1aaeea) [0x155460515eea]
  [bt] (3) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x16a2d5) [0x1554604d52d5]
  [bt] (4) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromDense+0x453) [0x15546041fbb3]
  [bt] (5) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x15554f7e7052]
  [bt] (6) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x15554f7e5925]
  [bt] (7) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x15554f7e606e]
  [bt] (8) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e4) [0x15554f7f72e4]

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/bin/somaticseq_parallel.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 308, in <module>
    subdirs = pool.map(runPaired_by_region_i, bed_splitted)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data contains `inf` or `nan`
Stack trace:
  [bt] (0) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x154c79) [0x1554604bfc79]
  [bt] (1) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x179e1d) [0x1554604e4e1d]
  [bt] (2) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1aaeea) [0x155460515eea]
  [bt] (3) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x16a2d5) [0x1554604d52d5]
  [bt] (4) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromDense+0x453) [0x15546041fbb3]
  [bt] (5) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x15554f7e7052]
  [bt] (6) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x15554f7e5925]
  [bt] (7) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x15554f7e606e]
  [bt] (8) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e4) [0x15554f7f72e4]

Which seems to be caused by

xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data containsinfornan This only happens when applying the model to high read depth WGS samples. WES samples works fine.

litaifang commented 5 months ago

Yeah thanks for the report. ntree_limit has been deprecated in xgboost. Let me figure out what's the best way forward.

GACGAMA commented 5 months ago

I've managed to make it work with version 2.0.3 of xgboost: Starting at line 163 of somatic_xgboost.py:

   for input_data in pd.read_csv(
        input_tsv, sep="\t", chunksize=chunksize, low_memory=False
    ):

        test_data = ntchange.ntchange(input_data)

        for non_feature_i in non_feature:
            if non_feature_i in test_data:
                test_data.drop(non_feature_i, axis=1, inplace=True)

        #transform infinite values into np.nan so xgboost can deal with it    
        test_data.replace([np.inf, -np.inf], np.nan, inplace=True)
        dtest = xgb.DMatrix(test_data)
        #change ntree_limit to iteration_range=(0, iterations)
        scores = xgb_model.predict(dtest, iteration_range=(0, iterations))
        predicted = input_data.assign(SCORE=scores)
        predicted.to_csv(
            output_tsv,
            sep="\t",
            index=False,
            mode=writeMode,
            header=writeHeader,
            na_rep="nan",
        )

I'm not sure if for your model, transforming infinite numbers to NAN is the best approach. Maybe turning -INF and INF values to 0 or to a very big number is better. I would suggest pointing to specific versions of needed packages when building somaticseq with conda/pip! With xgboost 1.7.3 ntree_limit still existed

litaifang commented 5 months ago

Thanks. iteration_range was introduced in v1.4. Think I'll make it xgboost>=1.4.

litaifang commented 5 months ago

Do you know where did the data get unexpected Inf or NaN?

GACGAMA commented 5 months ago

I have no idea exactly where it is because i'm processing in a cluster, but I could try running it and saving the train dataset to explore if you want to. I`m just not sure how. What I know is that those samples which have INF values are FFPE and, even tough we have a good depth and coverage, we found a lot of paraffin artifacts (to which we wanted to try somaticseq). I'm still working on the data to compare simple consensus and the AI model.

litaifang commented 5 months ago

Looked around on internet it seems people have gotten that error when there are very large number (e.g., 1e300) in the data: https://stackoverflow.com/questions/67986268/xgboost-check-failed-valid-input-data-contains-inf-or-nan

GACGAMA commented 4 months ago

I think I might have found why there are infinite values. Somehow, some SNPs and INDELs are being called while having 0 alternate alleles (DP4), or 0% VAF. I have no ideia why tough. I will probably filter out all variants with 0% VAF before doing the consensus, but am not sure if I should do that before or after consensus calling

litaifang commented 4 months ago

Hmm, maybe because DP4 and VAF has quality filters (e.g., minimum mapping quality or base call quality) that a mutation caller used to make that call.