dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
25.87k stars 8.69k forks source link

"ValueError: DataFrame.dtypes" upon calling XGBRegressor.fit(), but all columns are numeric #9789

Closed AhmetZamanis closed 8 months ago

AhmetZamanis commented 8 months ago

Issue

I am trying to fit an XGBRegressor model with early stopping & an eval_set in a Jupyter notebook. The training & validation data are Pandas DataFrames, and all columns are of float or int datatype. I still get the following error upon running XGBRegressor.fit():

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameterenable_categorical must be set to True. Invalid columns:market_id: object, store_id: object, store_primary_category: object, order_protocol: object

The four mentioned columns are originally of object datatype, but they are encoded & converted to float with TargetEncoder from package category_encoders before model training. I suspect some sort of leftover "metadata" is causing XGBoost to still interpret them as object type columns.

I have tried the following workarounds, which all result in the same error:

I will share my environment & versions, code snippets & the full traceback below. I don't think I'm allowed to share the dataset, so the full notebook may not be additionally useful. Please let me know if I can help further.

Environment info

Jupyter versions:

Relevant code & traceback

I'm including only the code snippets that I think are relevant from my notebook. Basically, the steps are:

# Read data
df = pd.read_csv("./InputData/full_data.csv")
X = df.drop(["created_at", "actual_delivery_time", "duration"], axis = 1)
X.dtypes

market_id object store_id object store_primary_category object order_protocol object total_items int64 subtotal int64 num_distinct_items int64 min_item_price int64 max_item_price int64 total_onshift_dashers int64 total_busy_dashers int64 total_outstanding_orders int64 estimated_order_place_duration int64 estimated_store_to_consumer_driving_duration int64 weekday_0 int64 weekday_1 int64 weekday_2 int64 weekday_3 int64 weekday_4 int64 weekday_5 int64 hour_sin float64 hour_cos float64 minute_sin float64 minute_cos float64 superbowl int64 valentines int64 total_available_dashers int64 ratio_busy_dashers float64 busy_score float64 dtype: object

# Train - val - test split
X_train, X_val, X_test = X[:train_end], X[train_end:val_end], X[val_end:]
y_train, y_val, y_test = y[:train_end], X[train_end:val_end], X[val_end:]

# Create target encoders

# store_id encoder with hierarchy, top level market_id
hierarchy = pd.DataFrame(X["market_id"]).rename({"market_id": "HIER_store_id_1"}, axis = 1)
encoder_storeid = TargetEncoder(cols = ["store_id"], hierarchy = hierarchy)

# Encoder for remaining categoricals, without hierarchy
encoder = TargetEncoder(cols = ["market_id", "store_primary_category", "order_protocol"])

pipeline = Pipeline([
    ("encoder_storeid", encoder_storeid),
    ("encoder", encoder)
])

# Preprocess data
X_train = pipeline.fit_transform(X_train, y_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

X_train.dtypes

market_id float64 store_id float64 store_primary_category float64 order_protocol float64 total_items int64 subtotal int64 num_distinct_items int64 min_item_price int64 max_item_price int64 total_onshift_dashers int64 total_busy_dashers int64 total_outstanding_orders int64 estimated_order_place_duration int64 estimated_store_to_consumer_driving_duration int64 weekday_0 int64 weekday_1 int64 weekday_2 int64 weekday_3 int64 weekday_4 int64 weekday_5 int64 hour_sin float64 hour_cos float64 minute_sin float64 minute_cos float64 superbowl int64 valentines int64 total_available_dashers int64 ratio_busy_dashers float64 busy_score float64 dtype: object

Below is the code snippet & traceback for creating & fitting the model. It is part of an Optuna objective function, but I'm omitting the Optuna code as I don't think it's relevant.

# Create model
    callback_pruner = [optuna.integration.XGBoostPruningCallback(
        trial, "validation_0-mean_squared_error")]

    model = XGBRegressor(
        device = "cuda",
        objective = "reg:squarederror",
        callbacks = callback_pruner,
        verbosity = 0,
        random_state = random_state,
        n_estimators = 5000,
        early_stopping_rounds = 50,
        eval_metric = mean_squared_error,
        max_depth = max_depth,
        learning_rate = learning_rate,
        min_child_weight = min_child_weight,
        gamma = gamma,
        reg_alpha = reg_alpha,
        reg_lambda = reg_lambda,
        subsample = subsample,
        colsample_bytree = colsample_bytree
    )

    # Train model with early stopping

    model.fit(
        X = X_train, 
        y = y_train, 
        eval_set = [(X_val, y_val)], 
        verbose = False)

ValueError Traceback (most recent call last) Cell In[17], line 2 1 # Perform study ----> 2 study_xgb.optimize( 3 objective_xgb, 4 n_trials = 1000, 5 show_progress_bar = True)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\optuna\study\study.py:451, in Study.optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 348 def optimize( 349 self, 350 func: ObjectiveFuncType, (...) 357 show_progress_bar: bool = False, 358 ) -> None: 359 """Optimize an objective function. 360 361 Optimization is done by choosing a suitable set of hyperparameter values from a given (...) 449 If nested invocation of this method occurs. 450 """ --> 451 _optimize( 452 study=self, 453 func=func, 454 n_trials=n_trials, 455 timeout=timeout, 456 n_jobs=n_jobs, 457 catch=tuple(catch) if isinstance(catch, Iterable) else (catch,), 458 callbacks=callbacks, 459 gc_after_trial=gc_after_trial, 460 show_progress_bar=show_progress_bar, 461 )

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\optuna\study_optimize.py:66, in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar) 64 try: 65 if n_jobs == 1: ---> 66 _optimize_sequential( 67 study, 68 func, 69 n_trials, 70 timeout, 71 catch, 72 callbacks, 73 gc_after_trial, 74 reseed_sampler_rng=False, 75 time_start=None, 76 progress_bar=progress_bar, 77 ) 78 else: 79 if n_jobs == -1:

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\optuna\study_optimize.py:163, in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar) 160 break 162 try: --> 163 frozen_trial = _run_trial(study, func, catch) 164 finally: 165 # The following line mitigates memory problems that can be occurred in some 166 # environments (e.g., services that use computing containers such as GitHub Actions). 167 # Please refer to the following PR for further details: 168 # https://github.com/optuna/optuna/pull/325. 169 if gc_after_trial:

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\optuna\study_optimize.py:251, in _run_trial(study, func, catch) 244 assert False, "Should not reach." 246 if ( 247 frozen_trial.state == TrialState.FAIL 248 and func_err is not None 249 and not isinstance(func_err, catch) 250 ): --> 251 raise func_err 252 return frozen_trial

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\optuna\study_optimize.py:200, in _run_trial(study, func, catch) 198 with get_heartbeat_thread(trial._trial_id, study._storage): 199 try: --> 200 value_or_values = func(trial) 201 except exceptions.TrialPruned as e: 202 # TODO(mamu): Handle multi-objective cases. 203 state = TrialState.PRUNED

Cell In[15], line 39, in objective_xgb(trial) 18 model = XGBRegressor( 19 device = "cuda", 20 objective = "reg:squarederror", (...) 34 colsample_bytree = colsample_bytree 35 ) 37 # Train model with early stopping ---> 39 model.fit( 40 X = X_train, 41 y = y_train, 42 eval_set = [(X_val, y_val)], 43 verbose = False) 45 # Report best number of rounds 46 trial.set_user_attr("n_rounds", (model.best_iteration + 1))

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:729, in require_keyword_args..throw_if..inner_f(*args, kwargs) 727 for k, arg in zip(sig.parameters, args): 728 kwargs[k] = arg --> 729 return func(kwargs)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\sklearn.py:1051, in XGBModel.fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks) 1049 with config_context(verbosity=self.verbosity): 1050 evals_result: TrainingCallback.EvalsLog = {} -> 1051 train_dmatrix, evals = _wrap_evaluation_matrices( 1052 missing=self.missing, 1053 X=X, 1054 y=y, 1055 group=None, 1056 qid=None, 1057 sample_weight=sample_weight, 1058 base_margin=base_margin, 1059 feature_weights=feature_weights, 1060 eval_set=eval_set, 1061 sample_weight_eval_set=sample_weight_eval_set, 1062 base_margin_eval_set=base_margin_eval_set, 1063 eval_group=None, 1064 eval_qid=None, 1065 create_dmatrix=self._create_dmatrix, 1066 enable_categorical=self.enable_categorical, 1067 feature_types=self.feature_types, 1068 ) 1069 params = self.get_xgb_params() 1071 if callable(self.objective):

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\sklearn.py:585, in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, enable_categorical, feature_types) 583 evals.append(train_dmatrix) 584 else: --> 585 m = create_dmatrix( 586 data=valid_X, 587 label=valid_y, 588 weight=sample_weight_eval_set[i], 589 group=eval_group[i], 590 qid=eval_qid[i], 591 base_margin=base_margin_eval_set[i], 592 missing=missing, 593 enable_categorical=enable_categorical, 594 feature_types=feature_types, 595 ref=train_dmatrix, 596 ) 597 evals.append(m) 598 nevals = len(evals)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\sklearn.py:954, in XGBModel._create_dmatrix(self, ref, kwargs) 952 if _can_use_qdm(self.tree_method) and self.booster != "gblinear": 953 try: --> 954 return QuantileDMatrix( 955 kwargs, ref=ref, nthread=self.n_jobs, max_bin=self.max_bin 956 ) 957 except TypeError: # QuantileDMatrix supports lesser types than DMatrix 958 pass

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:729, in require_keyword_args..throw_if..inner_f(*args, kwargs) 727 for k, arg in zip(sig.parameters, args): 728 kwargs[k] = arg --> 729 return func(kwargs)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:1528, in QuantileDMatrix.init(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, max_bin, ref, group, qid, label_lower_bound, label_upper_bound, feature_weights, enable_categorical, data_split_mode) 1508 if any( 1509 info is not None 1510 for info in ( (...) 1521 ) 1522 ): 1523 raise ValueError( 1524 "If data iterator is used as input, data like label should be " 1525 "specified as batch argument." 1526 ) -> 1528 self._init( 1529 data, 1530 ref=ref, 1531 label=label, 1532 weight=weight, 1533 base_margin=base_margin, 1534 group=group, 1535 qid=qid, 1536 label_lower_bound=label_lower_bound, 1537 label_upper_bound=label_upper_bound, 1538 feature_weights=feature_weights, 1539 feature_names=feature_names, 1540 feature_types=feature_types, 1541 enable_categorical=enable_categorical, 1542 )

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:1587, in QuantileDMatrix._init(self, data, ref, enable_categorical, **meta) 1575 config = make_jcargs( 1576 nthread=self.nthread, missing=self.missing, max_bin=self.max_bin 1577 ) 1578 ret = _LIB.XGQuantileDMatrixCreateFromCallback( 1579 None, 1580 it.proxy.handle, (...) 1585 ctypes.byref(handle), 1586 ) -> 1587 it.reraise() 1588 # delay check_call to throw intermediate exception first 1589 _check_call(ret)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:575, in DataIter.reraise(self) 573 exc = self._exception 574 self._exception = None --> 575 raise exc

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:556, in DataIter._handle_exception(self, fn, dft_ret) 553 return dft_ret 555 try: --> 556 return fn() 557 except Exception as e: # pylint: disable=broad-except 558 # Defer the exception in order to return 0 and stop the iteration. 559 # Exception inside a ctype callback function has no effect except 560 # for printing to stderr (doesn't stop the execution). 561 tb = sys.exc_info()[2]

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:640, in DataIter._next_wrapper..() 637 self._data_ref = ref 639 # pylint: disable=not-callable --> 640 return self._handle_exception(lambda: self.next(input_data), 0)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\data.py:1280, in SingleBatchInternalIter.next(self, input_data) 1278 return 0 1279 self.it += 1 -> 1280 input_data(**self.kwargs) 1281 return 1

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:729, in require_keyword_args..throw_if..inner_f(*args, kwargs) 727 for k, arg in zip(sig.parameters, args): 728 kwargs[k] = arg --> 729 return func(kwargs)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:632, in DataIter._next_wrapper..input_data(data, feature_names, feature_types, kwargs) 630 self._temporary_data = (new, cat_codes, feature_names, feature_types) 631 dispatch_proxy_set_data(self.proxy, new, cat_codes, self._allow_host) --> 632 self.proxy.set_info( 633 feature_names=feature_names, 634 feature_types=feature_types, 635 kwargs, 636 ) 637 self._data_ref = ref

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:729, in require_keyword_args..throw_if..inner_f(*args, kwargs) 727 for k, arg in zip(sig.parameters, args): 728 kwargs[k] = arg --> 729 return func(kwargs)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:931, in DMatrix.set_info(self, label, weight, base_margin, group, qid, label_lower_bound, label_upper_bound, feature_names, feature_types, feature_weights) 928 from .data import dispatch_meta_backend 930 if label is not None: --> 931 self.set_label(label) 932 if weight is not None: 933 self.set_weight(weight)

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\core.py:1069, in DMatrix.set_label(self, label) 1060 """Set label of dmatrix 1061 1062 Parameters (...) 1065 The label information to be set into DMatrix 1066 """ 1067 from .data import dispatch_meta_backend -> 1069 dispatch_meta_backend(self, label, "label", "float")

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\data.py:1221, in dispatch_meta_backend(matrix, data, name, dtype) 1219 return 1220 if _is_pandasdf(data): -> 1221 data, , _ = _transform_pandas_df(data, False, meta=name, meta_type=dtype) 1222 _meta_from_numpy(data, name, dtype, handle) 1223 return

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\data.py:490, in _transform_pandas_df(data, enable_categorical, feature_names, feature_types, meta, meta_type) 483 for dtype in data.dtypes: 484 if not ( 485 (dtype.name in _pandas_dtype_mapper) 486 or is_pd_sparse_dtype(dtype) 487 or (is_pd_cat_dtype(dtype) and enable_categorical) 488 or is_pa_ext_dtype(dtype) 489 ): --> 490 _invalid_dataframe_dtype(data) 491 if is_pa_ext_dtype(dtype): 492 pyarrow_extension = True

File ~\Documents\WorkLocal\DataScience\GitHub\MixedEffectsRegressionDeliveryTimes\venv\lib\site-packages\xgboost\data.py:308, in _invalid_dataframe_dtype(data) 306 type_err = "DataFrame.dtypes for data must be int, float, bool or category." 307 msg = f"""{type_err} {_ENABLE_CAT_ERR} {err}""" --> 308 raise ValueError(msg)

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameterenable_categorical must be set to True. Invalid columns:market_id: object, store_id: object, store_primary_category: object, order_protocol: object

trivialfis commented 8 months ago

Hi, I'm not sure how I can help with this error. XGBoost simply checks the DataFrame.dtypes parameter as it is, if some outer libraries or procedures are creating invalid types, there's not much we can do inside XGBoost.

AhmetZamanis commented 8 months ago

I see, but the datatypes in DataFrame.dtypes seem to be correct after the transformation, so I thought the issue may arise from XGBoost. I've also used older versions of XGBoost with data transformed by category_encoders without issue.

I'll keep tinkering and notify if I find out more.

AhmetZamanis commented 8 months ago

Very sorry to waste your time, there is no issue, everything works as expected. I just made a very silly typo in my data splitting code, and realized it very late.

Feel free to delete this thread if it's possible, I couldn't figure out how.

Gandharv29 commented 7 months ago

can u please share what mistake exactly did you make? I am also facing the same issue

AhmetZamanis commented 7 months ago

@Gandharv29 I made an error in splitting the features and the target:

y_train, y_val, y_test = y[:train_end], X[train_end:val_end], X[val_end:]

Because of this, I was unknowingly trying to pass the unprocessed features as the target vector, and correctly getting the datatype error. I doubt you have the same issue.