Open GACGAMA opened 5 months ago
Using an older version of xgboost (1.7.1 and 1.6) I get different errors, which might be more on the bug side.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 84, in runPaired_by_region
run_somaticseq.runPaired(
File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 169, in runPaired
modelPredictor(ensembleSnv, classifiedSnvTsv, algo, classifier_snv, iterations=iterations, features_to_exclude=features_excluded)
File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 87, in modelPredictor
somatic_xgboost.predictor(classifier, input_file, output_file, non_features, iterations)
File "/home/ggama1/programs/somaticseq/somaticseq/somatic_xgboost.py", line 172, in predictor
dtest = xgb.DMatrix(test_data)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 532, in inner_f
return f(**kwargs)
^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 643, in __init__
handle, feature_names, feature_types = dispatch_data_backend(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 896, in dispatch_data_backend
return _from_pandas_df(data, enable_categorical, missing, threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 348, in _from_pandas_df
return _from_numpy_array(data, missing, nthread, feature_names, feature_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/data.py", line 184, in _from_numpy_array
_check_call(
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/core.py", line 203, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data contains `inf` or `nan`
Stack trace:
[bt] (0) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x154c79) [0x1554604bfc79]
[bt] (1) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x179e1d) [0x1554604e4e1d]
[bt] (2) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1aaeea) [0x155460515eea]
[bt] (3) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x16a2d5) [0x1554604d52d5]
[bt] (4) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromDense+0x453) [0x15546041fbb3]
[bt] (5) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x15554f7e7052]
[bt] (6) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x15554f7e5925]
[bt] (7) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x15554f7e606e]
[bt] (8) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e4) [0x15554f7f72e4]
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ggama1/.conda/envs/somaticseq/bin/somaticseq_parallel.py", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 308, in <module>
subdirs = pool.map(runPaired_by_region_i, bed_splitted)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data contains `inf` or `nan`
Stack trace:
[bt] (0) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x154c79) [0x1554604bfc79]
[bt] (1) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x179e1d) [0x1554604e4e1d]
[bt] (2) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x1aaeea) [0x155460515eea]
[bt] (3) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(+0x16a2d5) [0x1554604d52d5]
[bt] (4) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromDense+0x453) [0x15546041fbb3]
[bt] (5) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x15554f7e7052]
[bt] (6) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x15554f7e5925]
[bt] (7) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x15554f7e606e]
[bt] (8) /home/ggama1/.conda/envs/somaticseq/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e4) [0x15554f7f72e4]
Which seems to be caused by
xgboost.core.XGBoostError: [18:14:40] ../src/data/data.cc:1163: Check failed: valid: Input data contains
infor
nan
This only happens when applying the model to high read depth WGS samples. WES samples works fine.
Yeah thanks for the report. ntree_limit has been deprecated in xgboost. Let me figure out what's the best way forward.
I've managed to make it work with version 2.0.3 of xgboost: Starting at line 163 of somatic_xgboost.py:
for input_data in pd.read_csv(
input_tsv, sep="\t", chunksize=chunksize, low_memory=False
):
test_data = ntchange.ntchange(input_data)
for non_feature_i in non_feature:
if non_feature_i in test_data:
test_data.drop(non_feature_i, axis=1, inplace=True)
#transform infinite values into np.nan so xgboost can deal with it
test_data.replace([np.inf, -np.inf], np.nan, inplace=True)
dtest = xgb.DMatrix(test_data)
#change ntree_limit to iteration_range=(0, iterations)
scores = xgb_model.predict(dtest, iteration_range=(0, iterations))
predicted = input_data.assign(SCORE=scores)
predicted.to_csv(
output_tsv,
sep="\t",
index=False,
mode=writeMode,
header=writeHeader,
na_rep="nan",
)
I'm not sure if for your model, transforming infinite numbers to NAN is the best approach. Maybe turning -INF and INF values to 0 or to a very big number is better. I would suggest pointing to specific versions of needed packages when building somaticseq with conda/pip! With xgboost 1.7.3 ntree_limit still existed
Thanks. iteration_range was introduced in v1.4. Think I'll make it xgboost>=1.4
.
Do you know where did the data get unexpected Inf or NaN?
I have no idea exactly where it is because i'm processing in a cluster, but I could try running it and saving the train dataset to explore if you want to. I`m just not sure how. What I know is that those samples which have INF values are FFPE and, even tough we have a good depth and coverage, we found a lot of paraffin artifacts (to which we wanted to try somaticseq). I'm still working on the data to compare simple consensus and the AI model.
Looked around on internet it seems people have gotten that error when there are very large number (e.g., 1e300) in the data: https://stackoverflow.com/questions/67986268/xgboost-check-failed-valid-input-data-contains-inf-or-nan
I think I might have found why there are infinite values. Somehow, some SNPs and INDELs are being called while having 0 alternate alleles (DP4), or 0% VAF. I have no ideia why tough. I will probably filter out all variants with 0% VAF before doing the consensus, but am not sure if I should do that before or after consensus calling
Hmm, maybe because DP4 and VAF has quality filters (e.g., minimum mapping quality or base call quality) that a mutation caller used to make that call.
I'm trying to run somaticseq_parallel on some samples VCFs to call the AI consensus. The version for SomaticSeq is SomaticSeq v3.7.3. Version of XGBOOST is 2.0.2 I've run all mutation callers, then, with the VCF files, did the following command:
somaticseq_parallel.py --classifier-snv /scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/SNV_model.classifier --classifier-indel /scratch4/nsobrei2/ggama1/training/somaticseq/ai_model_titration_ffpe_wgs_synth/INDEL_model.classifier --output-directory /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/consensus_AI/kids_first/BH12847_1_TUMOR --genome-reference /scratch4/nsobrei2/references/ncbi_grch38_cipher/GRCh38_full_analysis_set_plus_decoy_hla.fa -dbsnp /scratch4/nsobrei2/references/dbsnp/138_cipher/Homo_sapiens_assembly38.dbsnp138.vcf.gz --threads 38 paired --tumor-bam-file /scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_TUMOR.bam --normal-bam-file /scratch4/nsobrei2/ggama1/germline-tumor/bams/BH12847_1_GERMLINE.bam --mutect2-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuTect2.vcf.gz --vardict-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarDict.vcf.gz --somaticsniper-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.SomaticSniper.vcf.gz --muse-vcf /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.MuSE.vcf.gz --strelka-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.snv.vcf.gz --strelka-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.Strelka.indel.vcf.gz --varscan-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.snv.vcf.gz --varscan-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.VarScan2.indel.vcf.gz --lofreq-snv /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.snv.vcf.gz --lofreq-indel /scratch4/nsobrei2/ggama1/germline-tumor/cavatica/somaticseq/vcf_per_sample/extracted_vcf/kids_first/unsorted/BH12847_1_TUMOR.LoFreq.indel.vcf.gz
This is the output with the error
The output of the created AI model, used in the above code, was: