dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

"The truth value of a Series is ambiguous" exception when training with `sample_weights`. #81

Open KWiecko opened 3 years ago

KWiecko commented 3 years ago

What happened:

When sample_weight is specified and train() method input's npartitions is not equal to the number of workers the following evaluation:

sample_weight = concat(sample_weight) if np.all(sample_weight) else None

fails :(

What you expected to happen: Train model without exceptions :)

Minimal Complete Verifiable Example:

The following code raises an exception:

"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

The bug is located in 102nd line of dask_xgboost/core.py file (https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py)

import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import dask_xgboost as dxgb
import pandas as pd

if __name__ == '__main__':
    lc = LocalCluster(n_workers=2, threads_per_worker=1)
    c = Client(lc)

    npartitions = 4

    X = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [4, 3, 2, 1], 'w': [1, 2, 3, 4]})
    X_dd = dd.from_pandas(X, npartitions=npartitions)
    print(X_dd.columns)

    y = pd.Series([0, 0, 1, 1])
    y_dd = dd.from_pandas(y, npartitions=npartitions)

    w = X_dd['w']

    params = {
        "tree_method": "hist",
        'objective': 'binary:logistic'}

    bst = dxgb.train(
        client=c, params=params, data=X_dd, labels=y_dd, sample_weight=w)

The following chunk proposes a change which should fix this

# This is a change proposal
sample_weight = \
    concat(sample_weight) \
        if ~np.any([np.isnan(w).sum() for w in sample_weight]) \
        else None

# This is the old version and is bugged
# sample_weight = concat(sample_weight) if np.all(sample_weight) else None

One notable thing is that

np.all(pd.Series([0, 1, 2]))

will evaluate to False due to 0 in Series. Is that expected (is a weight for a single observation allowed to be 0) ? If such value for an atomic weight is allowed this condition will make dask-xgboost skip proper weights containing 0s.

Both conditions will fail silently when one of atomic weights is a NaN. I suppose when sample_weight is not None but weight for one of the observations is NaN the code should raise an Exception, e.g.:


if isinstance(sample_weight, tuple):
    if np.any([np.isnan(w).sum() for w in sample_weight if w is not None]):
        raise ValueError('Provided `sample_weights` contain NaNs')
else:
    if sample_weight is not None \
            and np.any([np.isnan(w).sum() for w in sample_weight]):
        raise ValueError('Provided `sample_weights` contain NaNs')

This was a tricky one to find.

Anything else we need to know?:

Environment:

jakirkham commented 3 years ago

Would suggest looking at XGBoost's own Dask integration over using this library (if it is an option for you)

https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

KWiecko commented 3 years ago

Unfortunately I am bound to xgboost==0.90. I tried to find the dask API buried inside xgboost==0.90 both in the docs (https://xgboost.readthedocs.io/en/release_0.90/index.html) and across the web and the only solution that I was able to find was dask-xgboost + xgboost.