autonlab / auton-survival

Auton Survival - an open source package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Events
http://autonlab.github.io/auton-survival
MIT License
315 stars 74 forks source link

ValueError: Input estimate contains NaN #94

Closed zapaishchykova closed 1 year ago

zapaishchykova commented 1 year ago

Hello! Thanks for such unique package. I am trying to use DeepSurvivalMachines (note: for example, on the same dataset DeepCoxMixtures work without any issues), here is the error log:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [24], line 12
      8 et_val = np.array([(e_val[i], t_val[i]) for i in range(len(e_val))],
      9                  dtype = [('e', bool), ('t', float)])
     11 for i, _ in enumerate(times):
---> 12     cis.append(concordance_index_ipcw(et_train, et_test, out_risk[:, i], times[i])[0])
     13 #brs.append(brier_score(et_train, et_test, out_survival, times)[1])
     14 roc_auc = []

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sksurv/metrics.py:324, in concordance_index_ipcw(survival_train, survival_test, estimate, tau, tied_tol)
    321     mask = test_time < tau
    322     survival_test = survival_test[mask]
--> 324 estimate = _check_estimate_1d(estimate, test_time)
    326 cens = CensoringDistributionEstimator()
    327 cens.fit(survival_train)

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sksurv/metrics.py:36, in _check_estimate_1d(estimate, test_time)
     35 def _check_estimate_1d(estimate, test_time):
---> 36     estimate = check_array(estimate, ensure_2d=False, input_name="estimate")
     37     if estimate.ndim != 1:
     38         raise ValueError(
     39             'Expected 1D array, got {:d}D array instead:\narray={}.\n'.format(
     40                 estimate.ndim, estimate))

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sklearn/utils/validation.py:899, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    893         raise ValueError(
    894             "Found array with dim %d. %s expected <= 2."
    895             % (array.ndim, estimator_name)
    896         )
    898     if force_all_finite:
--> 899         _assert_all_finite(
    900             array,
    901             input_name=input_name,
    902             estimator_name=estimator_name,
    903             allow_nan=force_all_finite == "allow-nan",
    904         )
    906 if ensure_min_samples > 0:
    907     n_samples = _num_samples(array)

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sklearn/utils/validation.py:146, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    124         if (
    125             not allow_nan
    126             and estimator_name
   (...)
    130             # Improve the error message on how to handle missing values in
    131             # scikit-learn.
    132             msg_err += (
    133                 f"\n{estimator_name} does not accept missing values"
    134                 " encoded as NaN natively. For supervised learning, you might want"
   (...)
    144                 "#estimators-that-handle-nan-values"
    145             )
--> 146         raise ValueError(msg_err)
    148 # for object dtype data, we only check for NaNs (GH-13254)
    149 elif X.dtype == np.dtype("object") and not allow_nan:

ValueError: Input estimate contains NaN.

Some more details:

  1. I convert all df to the float64, and to int64:
    
    features = df_features.copy().astype('float64')

outcomes = pd.DataFrame() outcomes['event'] = pd.DataFrame(data_y)['Status'].astype('int64') outcomes['time'] = pd.DataFrame(data_y)['Survival_in_days'].astype('int64')

features_val = df_features_val.copy().astype('float64') outcomes_val = pd.DataFrame() outcomes_val['event'] = pd.DataFrame(data_y_val)['Status'].astype('int64') outcomes_val['time'] = pd.DataFrame(data_y_val)['Survival_in_days'].astype('int64')

2. Then training the model:

from auton_survival.models.dsm import DeepSurvivalMachines from sklearn.model_selection import ParameterGrid

param_grid = {'k' : [3, 4, 6], 'distribution' : ['LogNormal', 'Weibull'], 'learning_rate' : [ 1e-4, 1e-3], 'layers' : [ [], [100], [100, 100] ] } params = ParameterGrid(param_grid)

models = [] for param in params: model = DeepSurvivalMachines(k = param['k'], distribution = param['distribution'], layers = param['layers'])

# The fit method is called to train the model
model.fit(x, outcomes.time, outcomes.event, iters = 100, learning_rate = param['learning_rate'])
models.append([[model.compute_nll(x_val, outcomes_val.time, outcomes_val.event), model]])

best_model = min(models) model = best_model[0][1]

3. And then it fails on the evaluation step:

cis = [] brs = []

et_train = np.array([(e_train[i], t_train[i]) for i in range(len(e_train))], dtype = [('e', bool), ('t', float)]) et_test = np.array([(e_test[i], t_test[i]) for i in range(len(e_test))], dtype = [('e', bool), ('t', float)]) et_val = np.array([(e_val[i], t_val[i]) for i in range(len(eval))], dtype = [('e', bool), ('t', float)]) times = np.quantile(outcomes.time[outcomes.event==1], [0.25, 0.5, 0.6]).tolist() for i, in enumerate(times): cis.append(concordance_index_ipcw(et_train, et_test, out_risk[:, i], times[i])[0])


When I check out_risk[:, I] that was created by the model.predict_risk(x_val, times) its all nans:

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]


Does that mean that the model did not converge? Any tips are appreciated!
chiragnagpal commented 1 year ago

HI @zapaishchykova DSM expects all passed times t to be strictly greater than 0. can you check that in your data?

zapaishchykova commented 1 year ago

Hi! Here it is:

times = np.quantile(outcomes.time[outcomes.event==1], [0.25, 0.5, 0.6]).tolist()
times
[13.0, 27.0, 35.0]
chiragnagpal commented 1 year ago

can you check if there are any zeros in outcomes.time

zapaishchykova commented 1 year ago

there are indeed some zeros in the outcomes.time ! Should I replace them with some small value instead?

chiragnagpal commented 1 year ago

aha! yes, either add a small non-zero factor 1e-4 or something like that, or just add a constant value to every time to change the scale to be strictly positive.

zapaishchykova commented 1 year ago

Some progress, now I get different error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [48], line 18
     16 roc_auc = []
     17 for i, _ in enumerate(times):
---> 18     roc_auc.append(cumulative_dynamic_auc(et_train, et_test, out_risk[:, i], times[i])[0])
     19 for horizon in enumerate(horizons):
     20     print(f"For {horizon[1]} quantile,")

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sksurv/metrics.py:468, in cumulative_dynamic_auc(survival_train, survival_test, estimate, times, tied_tol)
    466 cens = CensoringDistributionEstimator()
    467 cens.fit(survival_train)
--> 468 ipcw = cens.predict_ipcw(survival_test)
    470 # expand arrays to (n_samples, n_times) shape
    471 test_time = numpy.broadcast_to(test_time[:, numpy.newaxis], (n_samples, n_times))

File ~/miniconda3/envs/pycox310/lib/python3.10/site-packages/sksurv/nonparametric.py:448, in CensoringDistributionEstimator.predict_ipcw(self, y)
    445 Ghat = self.predict_proba(time[event])
    447 if (Ghat == 0.0).any():
--> 448     raise ValueError("censoring survival function is zero at one or more time points")
    450 weights = numpy.zeros(time.shape[0])
    451 weights[event] = 1.0 / Ghat

ValueError: censoring survival function is zero at one or more time points
chiragnagpal commented 1 year ago

Are you trying to perform Cross Validation? is this a relatively small dataset?

zapaishchykova commented 1 year ago

It is a small dataset! Interestingly, with this dataset using scikit-survival I was also unable to compute ROC with similar looking error

chiragnagpal commented 1 year ago

yeah we use scikit-survival for the underlying metrics computation. Try shuffling your CV folds ? It might help, essentially computing performance for the models requires the same range of times to be present in the training and testing folds, in the case of your dataset the smaller size leads to a fold having times beyond what it has seen in the training set

zapaishchykova commented 1 year ago

aha, maybe then stratified creation of the folds will make more sense for such small set. Closing this for now, thanks a lot!