dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.31k stars 8.73k forks source link

`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

Closed jrbourbeau closed 2 years ago

jrbourbeau commented 2 years ago

I came across a use case where attempting to fit a DaskXGBClassifier on a Dask Array whose partitions are scipy.sparse.csr_matrixs (as is returned by Dask-ML's HashingVectorizer) results in a AttributeError: divisions not found error (full traceback included below).

From doing some initial debugging it appears the underlying issue is that during the fitting process we end up passing a list of sparse matrices to Dask's dd.multi.concat here

https://github.com/dmlc/xgboost/blob/d33854af1b4f783c5230bb21aff7234b16f409f7/python-package/xgboost/dask.py#L207

However, dd.multi.concat expects a list of Dask DataFrames, which is where the AttributeError: divisions not found is coming from (Dask DataFrames have a .divisions attribute which dd.multi.concat assumes exists).

Here's an example code snippet which should reproduce the issue when using the latest xgboost (1.5.0) and dask (2021.11.2) / distributed (2021.11.2) releases:

import dask.dataframe as dd
import dask_ml.feature_extraction.text
import pandas as pd
import sklearn.datasets
from dask.distributed import Client
from xgboost.dask import DaskXGBClassifier

if __name__ == "__main__":

    with Client():
        # Create Dask DataFrame from sklearn 20newsgroups dataset
        bunch = sklearn.datasets.fetch_20newsgroups()
        df = dd.from_pandas(
            pd.DataFrame({"text": bunch.data, "target": bunch.target}), npartitions=25
        )

        # Create features with dask-ml's `HashingVectorizer``
        vect = dask_ml.feature_extraction.text.HashingVectorizer()
        X = vect.fit_transform(df["text"])

        # Format classification labels
        y = df["target"].to_dask_array()

        # Train XGBoost classifier
        clf = DaskXGBClassifier()
        print(f"{X = }")
        print(f"{y = }")
        clf.fit(X, y)  # Results in `AttributeError: divisions not found`
Full traceback: ``` Traceback (most recent call last): File "/Users/james/projects/coiled/evangelism-private/mongodb-with-coiled/test.py", line 28, in clf.fit(X, y) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1817, in fit return self._client_sync(self._fit_async, **args) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1623, in _client_sync return self.client.sync(func, **kwargs, asynchronous=asynchronous) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 865, in sync return sync( File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 327, in sync raise exc.with_traceback(tb) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 310, in f result[0] = yield future File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1775, in _fit_async results = await self.client.sync( File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 931, in _train_async results = await client.gather(futures, asynchronous=True) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 1842, in _gather raise exception.with_traceback(traceback) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 867, in dispatched_train local_dtrain = _dmatrix_from_list_of_parts(**dtrain_ref) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 800, in _dmatrix_from_list_of_parts return _create_dmatrix(**kwargs) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 774, in _create_dmatrix _data = concat(data) File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 206, in concat return dd.multi.concat(list(value), axis=0) File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1237, in concat if all( File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1238, in dfs[i].divisions[-1] < dfs[i + 1].divisions[0] File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/scipy/sparse/base.py", line 687, in __getattr__ raise AttributeError(attr + " not found") AttributeError: divisions not found ```
trivialfis commented 2 years ago

Thank you for opening the issue. I will work on some tests for sparse and scipy.sparse with dasks.

avriiil commented 2 years ago

I'm encountering the same issue as @jrbourbeau with the following package versions: xgboost: 1.5.1 dask: 2022.02.0 distributed: 2022.02.0

The example code snippet above returns the same error: "AttributeError: divisions not found"

@trivialfis -- were your changes merged into 1.5.1?

avriiil commented 2 years ago

@trivialfis - any update on this? I am still encountering this issue while running xgboost 1.5.1

trivialfis commented 2 years ago

@rrpelgrim Please update to the latest XGBoost 1.6.1