Open Stochastic13 opened 3 years ago
Are you able to post your data and training script? That will help us further diagnose this problem.
@hcho3 I can post the main part of training script and the output. There is a large section of preprocessing and setting up the CV that I am skipping. Also the parameters are just so as to get the error quickly.
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
print('Imbalance: ', imbalance)
print('Length/NA ytrain:', len(y_coded_train), np.sum(np.isnan(y_coded_train)))
print('Length/NA xtrain:', xsub_train.shape, np.sum(np.isnan(xsub_train.to_numpy().flatten())))
print('Length/NA yest:', len(y_coded_test), np.sum(np.isnan(y_coded_test)))
print('Length/NA xtest:', xsub_test.shape, np.sum(np.isnan(xsub_test.to_numpy().flatten())))
print(p)
print(np.sum(ysub_train), np.sum(ysub_test))
m = xg.train(params=p, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
evals=[(data_m, 'train'), (data_v, 'eval')])
Output (In other runs, like I said above, the score does not have to reach this high for the error):
Imbalance: 31.582978723404256
Length/NA ytrain: 6125 0
Length/NA xtrain: (6125, 52) 0
Length/NA yest: 1532 0
Length/NA xtest: (1532, 52) 0
{'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'unifo
rm', 'subsample': 0.8, 'tree_method': 'gpu_hist', 'objective': 'survival:cox', 'eval_metric': 'cox-nloglik', 'seed': 0, 'verbosity': 1}
188 47
[0] train-cox-nloglik:8.88308 eval-cox-nloglik:6.99363
[1] train-cox-nloglik:9.26756 eval-cox-nloglik:7.79090
[2] train-cox-nloglik:9.64561 eval-cox-nloglik:8.33538
[3] train-cox-nloglik:10.17499 eval-cox-nloglik:9.04028
[4] train-cox-nloglik:10.41129 eval-cox-nloglik:9.17929
[5] train-cox-nloglik:10.87415 eval-cox-nloglik:10.01846
[6] train-cox-nloglik:11.59993 eval-cox-nloglik:10.37045
[7] train-cox-nloglik:12.22635 eval-cox-nloglik:10.77931
[8] train-cox-nloglik:12.22635 eval-cox-nloglik:10.77931
[9] train-cox-nloglik:12.40515 eval-cox-nloglik:11.21803
[10] train-cox-nloglik:12.40514 eval-cox-nloglik:11.21803
[11] train-cox-nloglik:12.40514 eval-cox-nloglik:11.21803
Traceback (most recent call last):
File "xgboost_survival_cv.py", line 101, in <module>
evals=[(data_m, 'train'), (data_v, 'eval')])
File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
early_stopping_rounds=early_stopping_rounds)
File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
if callbacks.after_iteration(bst, i, dtrain, evals):
File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
self._update_history(score, epoch)
File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'
Should I print anything else of importance?
@Stochastic13 Can you post the data after the pre-processing step? If we cannot run the program ourselves, it's hard for us developers to find the cause of the error.
@hcho3 I understand. The data is confidential unfortunately. Here's a reproducible example I recreated with random data:
import numpy as np
import pandas as pd
import xgboost as xg
print(xg.__version__)
print(pd.__version__)
print(np.__version__)
param_cv = dict()
param_cv['eta'] = 0.3
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 100
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.8
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.8
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1
np.random.seed(0)
imbalance = 31.58
xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]
xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
evals=[(data_m, 'train'), (data_v, 'eval')])
And the Output:
1.3.3
1.1.5
1.19.5
[0] train-cox-nloglik:7.98620 eval-cox-nloglik:6.65554
[1] train-cox-nloglik:8.27961 eval-cox-nloglik:7.17497
[2] train-cox-nloglik:8.66985 eval-cox-nloglik:7.64955
[3] train-cox-nloglik:9.10624 eval-cox-nloglik:8.32161
[4] train-cox-nloglik:9.56293 eval-cox-nloglik:8.84506
[5] train-cox-nloglik:10.12860 eval-cox-nloglik:9.36206
[6] train-cox-nloglik:10.66752 eval-cox-nloglik:10.30291
[7] train-cox-nloglik:11.37288 eval-cox-nloglik:10.78025
[8] train-cox-nloglik:12.12385 eval-cox-nloglik:12.21345
[9] train-cox-nloglik:12.66841 eval-cox-nloglik:13.19342
[10] train-cox-nloglik:12.66842 eval-cox-nloglik:13.19342
[11] train-cox-nloglik:12.66842 eval-cox-nloglik:13.19342
Traceback (most recent call last):
File "xgboost_error.py", line 39, in <module>
evals=[(data_m, 'train'), (data_v, 'eval')])
File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
early_stopping_rounds=early_stopping_rounds)
File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
if callbacks.after_iteration(bst, i, dtrain, evals):
File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
self._update_history(score, epoch)
File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'
@hcho3 I can also try with similar random datasets with different censoring extents, if that sounds like something that helps narrow down the problem. I was just worried if I had made some mistake in running the training.
Weird that the nloglik increases.
@trivialfis The increase doesn't have to be large either for the error, in case it helps. The following has the same error after 85 iterations, but both the train
and the eval
score does not change at all up to the first 5 decimal places.
import numpy as np
import pandas as pd
import xgboost as xg
print(xg.__version__)
print(pd.__version__)
print(np.__version__)
param_cv = dict()
param_cv['eta'] = 0.1
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 200
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.5
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.5
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1
np.random.seed(0)
imbalance = 31.58
xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]
xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=600,
evals=[(data_m, 'train'), (data_v, 'eval')])
quit(1)
I played a bit with the code. Without passing weight
s, I could not reproduce the problem. However, even with quite low weights (imbalance 2 or 3), the problem remained.
I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to
survival:cox
, I repeatedly get this error.Traceback (most recent call last): File "xgboost_survival_cv.py", line 94, in <module> evals=[(data_m, 'train'), (data_v, 'eval')]) File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train early_stopping_rounds=early_stopping_rounds) File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal if callbacks.after_iteration(bst, i, dtrain, evals): File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration self._update_history(score, epoch) File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history name, s = d[0], float(d[1]) ValueError: could not convert string to float: '-nan(ind)'
There are no
nan
in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik
) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implementearly_stopping_rounds
so this is not a long-term option), I still getnan
(orinf
) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.The run parameters were thus:
{'colsample_bytree': 0.8, 'eta': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 20, 'sampling_method': 'uniform', 'subsample': 0.8, 'tree_method':'gpu_hist', 'verbosity':1, 'seed':0, 'objective':'survival:cox', 'eval_metric':'cox-nloglik'}
The same error for many different parameter sets. Another example:
{'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}
same problem, have you solved this? Thx!
@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.
@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.
If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!
Could you please checkout the new survial training module in XGBoost: https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html ?
@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.
If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!
Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.
in the xgboost\callback.py file
change line: "cvmap[(metric_idx, k)].append(float(v))"
to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"
Thanks
@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.
If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!
Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.
in the xgboost\callback.py file
change line: "cvmap[(metric_idx, k)].append(float(v))"
to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"
Thanks
Hello. I tried your method, but I encountered a new Bug.
File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\core.py", line 617, in inner_f
return func(**kwargs)
File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\training.py", line 196, in train
if cb_container.after_iteration(bst, i, dtrain, evals):
File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in after_iteration
metric_score = [(n, float(s)) for n, s in metric_score_str]
File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in <listcomp>
metric_score = [(n, float(s)) for n, s in metric_score_str]
ValueError: could not convert string to float: '-nan(ind)'
This error is converted to the next step.
Hello, I also tried the approach above but encounter the same bug as @skyee1
Please try to use version xgboost version 1.6. Good luck
On Tue, Jan 17, 2023 at 2:18 AM Divine @.***> wrote:
Hello, I also tried the approach above but encounter the same bug as @skyee1 https://github.com/skyee1
— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/6885#issuecomment-1384933633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJPAVEAU5O5WAB5BZDT4JTDWSZBT3ANCNFSM43KYTSLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Best regards!
I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to
survival:cox
, I repeatedly get this error.There are no
nan
in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik
) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implementearly_stopping_rounds
so this is not a long-term option), I still getnan
(orinf
) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.The run parameters were thus:
The same error for many different parameter sets. Another example:
{'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}