Error in cox-regression while evaluating

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.3k stars 8.73k forks source link

Error in cox-regression while evaluating #6885

Open Stochastic13 opened 3 years ago

Stochastic13 commented 3 years ago

I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to survival:cox, I repeatedly get this error.

Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 94, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
    early_stopping_rounds=early_stopping_rounds)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

There are no nan in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implement early_stopping_rounds so this is not a long-term option), I still get nan (or inf) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.

The run parameters were thus:

{'colsample_bytree': 0.8, 'eta': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 
'num_parallel_tree': 20, 'sampling_method': 'uniform', 'subsample': 0.8, 'tree_method':'gpu_hist', 
'verbosity':1,  'seed':0, 'objective':'survival:cox', 'eval_metric':'cox-nloglik'}

The same error for many different parameter sets. Another example: {'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}

hcho3 commented 3 years ago

Are you able to post your data and training script? That will help us further diagnose this problem.

Stochastic13 commented 3 years ago

@hcho3 I can post the main part of training script and the output. There is a large section of preprocessing and setting up the CV that I am skipping. Also the parameters are just so as to get the error quickly.

    data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
    data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
    print('Imbalance: ', imbalance)
    print('Length/NA ytrain:', len(y_coded_train), np.sum(np.isnan(y_coded_train)))
    print('Length/NA xtrain:', xsub_train.shape, np.sum(np.isnan(xsub_train.to_numpy().flatten())))
    print('Length/NA yest:', len(y_coded_test), np.sum(np.isnan(y_coded_test)))
    print('Length/NA xtest:', xsub_test.shape, np.sum(np.isnan(xsub_test.to_numpy().flatten())))
    print(p)
    print(np.sum(ysub_train), np.sum(ysub_test))
    m = xg.train(params=p, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
                 evals=[(data_m, 'train'), (data_v, 'eval')])

Output (In other runs, like I said above, the score does not have to reach this high for the error):

Imbalance:  31.582978723404256
Length/NA ytrain: 6125 0
Length/NA xtrain: (6125, 52) 0
Length/NA yest: 1532 0
Length/NA xtest: (1532, 52) 0
{'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'unifo
rm', 'subsample': 0.8, 'tree_method': 'gpu_hist', 'objective': 'survival:cox', 'eval_metric': 'cox-nloglik', 'seed': 0, 'verbosity': 1}
188 47
[0]     train-cox-nloglik:8.88308       eval-cox-nloglik:6.99363
[1]     train-cox-nloglik:9.26756       eval-cox-nloglik:7.79090
[2]     train-cox-nloglik:9.64561       eval-cox-nloglik:8.33538
[3]     train-cox-nloglik:10.17499      eval-cox-nloglik:9.04028
[4]     train-cox-nloglik:10.41129      eval-cox-nloglik:9.17929
[5]     train-cox-nloglik:10.87415      eval-cox-nloglik:10.01846
[6]     train-cox-nloglik:11.59993      eval-cox-nloglik:10.37045
[7]     train-cox-nloglik:12.22635      eval-cox-nloglik:10.77931
[8]     train-cox-nloglik:12.22635      eval-cox-nloglik:10.77931
[9]     train-cox-nloglik:12.40515      eval-cox-nloglik:11.21803
[10]    train-cox-nloglik:12.40514      eval-cox-nloglik:11.21803
[11]    train-cox-nloglik:12.40514      eval-cox-nloglik:11.21803
Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 101, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
    early_stopping_rounds=early_stopping_rounds)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

Should I print anything else of importance?

hcho3 commented 3 years ago

@Stochastic13 Can you post the data after the pre-processing step? If we cannot run the program ourselves, it's hard for us developers to find the cause of the error.

Stochastic13 commented 3 years ago

@hcho3 I understand. The data is confidential unfortunately. Here's a reproducible example I recreated with random data:

import numpy as np
import pandas as pd
import xgboost as xg

print(xg.__version__)
print(pd.__version__)
print(np.__version__)

param_cv = dict()
param_cv['eta'] = 0.3
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 100
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.8
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.8
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1
np.random.seed(0)

imbalance = 31.58

xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]

xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
             evals=[(data_m, 'train'), (data_v, 'eval')])

And the Output:

1.3.3
1.1.5
1.19.5
[0]     train-cox-nloglik:7.98620       eval-cox-nloglik:6.65554
[1]     train-cox-nloglik:8.27961       eval-cox-nloglik:7.17497
[2]     train-cox-nloglik:8.66985       eval-cox-nloglik:7.64955
[3]     train-cox-nloglik:9.10624       eval-cox-nloglik:8.32161
[4]     train-cox-nloglik:9.56293       eval-cox-nloglik:8.84506
[5]     train-cox-nloglik:10.12860      eval-cox-nloglik:9.36206
[6]     train-cox-nloglik:10.66752      eval-cox-nloglik:10.30291
[7]     train-cox-nloglik:11.37288      eval-cox-nloglik:10.78025
[8]     train-cox-nloglik:12.12385      eval-cox-nloglik:12.21345
[9]     train-cox-nloglik:12.66841      eval-cox-nloglik:13.19342
[10]    train-cox-nloglik:12.66842      eval-cox-nloglik:13.19342
[11]    train-cox-nloglik:12.66842      eval-cox-nloglik:13.19342
Traceback (most recent call last):
  File "xgboost_error.py", line 39, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
    early_stopping_rounds=early_stopping_rounds)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

Stochastic13 commented 3 years ago

@hcho3 I can also try with similar random datasets with different censoring extents, if that sounds like something that helps narrow down the problem. I was just worried if I had made some mistake in running the training.

trivialfis commented 3 years ago

Weird that the nloglik increases.

Stochastic13 commented 3 years ago

@trivialfis The increase doesn't have to be large either for the error, in case it helps. The following has the same error after 85 iterations, but both the train and the eval score does not change at all up to the first 5 decimal places.

import numpy as np
import pandas as pd
import xgboost as xg

print(xg.__version__)
print(pd.__version__)
print(np.__version__)

param_cv = dict()
param_cv['eta'] = 0.1
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 200
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.5
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.5
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1
np.random.seed(0)

imbalance = 31.58

xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]

xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=600,
             evals=[(data_m, 'train'), (data_v, 'eval')])
quit(1)

mayer79 commented 3 years ago

I played a bit with the code. Without passing weights, I could not reproduce the problem. However, even with quite low weights (imbalance 2 or 3), the problem remained.

XiangBu commented 3 years ago

I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to survival:cox, I repeatedly get this error.
Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 94, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
    early_stopping_rounds=early_stopping_rounds)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'
There are no nan in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implement early_stopping_rounds so this is not a long-term option), I still get nan (or inf) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.

The run parameters were thus:
{'colsample_bytree': 0.8, 'eta': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 
'num_parallel_tree': 20, 'sampling_method': 'uniform', 'subsample': 0.8, 'tree_method':'gpu_hist', 
'verbosity':1,  'seed':0, 'objective':'survival:cox', 'eval_metric':'cox-nloglik'}
The same error for many different parameter sets. Another example: {'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}

same problem, have you solved this? Thx!

Stochastic13 commented 3 years ago

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

XiangBu commented 3 years ago

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

trivialfis commented 3 years ago

Could you please checkout the new survial training module in XGBoost: https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html ?

Ruihaoh commented 2 years ago

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.

in the xgboost\callback.py file

change line: "cvmap[(metric_idx, k)].append(float(v))"

to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"

Thanks

skyee1 commented 1 year ago

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.

in the xgboost\callback.py file

change line: "cvmap[(metric_idx, k)].append(float(v))"

to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"

Thanks

Hello. I tried your method, but I encountered a new Bug.

File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\core.py", line 617, in inner_f
    return func(**kwargs)
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\training.py", line 196, in train
    if cb_container.after_iteration(bst, i, dtrain, evals):
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in after_iteration
    metric_score = [(n, float(s)) for n, s in metric_score_str]
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in <listcomp>
    metric_score = [(n, float(s)) for n, s in metric_score_str]
ValueError: could not convert string to float: '-nan(ind)'

This error is converted to the next step.

Ediebah commented 1 year ago

Hello, I also tried the approach above but encounter the same bug as @skyee1

Ruihaoh commented 1 year ago

Please try to use version xgboost version 1.6. Good luck

On Tue, Jan 17, 2023 at 2:18 AM Divine @.***> wrote:

Hello, I also tried the approach above but encounter the same bug as @skyee1 https://github.com/skyee1

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/6885#issuecomment-1384933633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJPAVEAU5O5WAB5BZDT4JTDWSZBT3ANCNFSM43KYTSLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best regards!