dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.12k stars 8.7k forks source link

DaskXGBRegressor with tweedie objective throws error when N-chunks = 1 #10493

Open lewis-anderson53 opened 2 months ago

lewis-anderson53 commented 2 months ago

I've been trying to use Dask to chunk data and train a regressor, but I kept receiving this error when trying out some sample code:

xgboost.core.XGBoostError: [15:26:58] /Users/runner/work/xgboost/xgboost/src/objective/regression_obj.cu:528: Check failed: info.labels.Size() != 0U (0 vs. 0) : label set cannot be empty

After some experimentation I realised that when I changed the chunks in my dask to a smaller size that the full array length, the error would go away.

I've included a snippet which should help recreate the issue:

import distributed
from xgboost import dask as dxgb
import dask.array as da

if __name__ == "__main__":
    client = distributed.Client()
    samples = 1000
    cols = 10

    X_train = da.random.random(size=(samples, cols), chunks=1000)  # set to <samples if you want the training to succeed
    y_train = X_train.sum(axis=1)

    X_val = da.random.random(size=(samples, cols), chunks=1000)
    y_val = X_val.sum(axis=1)

    model = dxgb.DaskXGBRegressor(
        learning_rate=0.1,
        max_depth=3,
        early_stopping_rounds=30,
        objective='reg:tweedie',
    )

    model.fit(
        X_train,
        y_train,
        eval_set=[(X_val, y_val)],
    )

    print(model.feature_importances_)

There are a few ways I can get this to succeed e.g. changing my objective to reg:squarederrr which seems happy when chunk size = data size. I don't understand my the tweedie objective doesn't work though.

Environment details: OS: MacOS 14.5 Chip: M1 Max Python Version: 3.11 Packages:

trivialfis commented 2 months ago

Hmm, sometimes dask generates empty partitions/workers, which is worked around case by case. The tweedie objective doesn't have the workaround.