Add support for using data with `pyarrow`-backed `pandas` extension dtypes

jrbourbeau commented 1 year ago

In pandas=1.5, pandas added support for using pyarrow-backed extension data dtypes. Using these data types (in particular string[pyarrow]) can lead to large performance improvements in terms of memory usage and computation wall time.

I went to use these new dtypes with xgboost and got a (very informative) error about them not being supported. Here's a minimal reproducer:

import pandas as pd
import xgboost as xgb

df = pd.DataFrame({"name": ["alice", "bob", "rick"], "x": range(3), "y": [1.3, 7.2, 0.6]})
df = df.astype(
    {
        "name": "string[pyarrow]",
        "x": "int64[pyarrow]",
        "y": "float64[pyarrow]",
    }
)

X = df.drop(columns=["name"])
y = df.loc[:, "name"]
dtrain = xgb.DMatrix(X, y)
output = xgb.train(
    {"verbosity": 2, "tree_method": "hist", "objective": "reg:squarederror"},
    dtrain,
    num_boost_round=4,
    evals=[(dtrain, "train")],
)
print(f"{output = }")

which outputs

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/xgboost-pyarrow.py", line 15, in <module>
    dtrain = xgb.DMatrix(X, y)
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/core.py", line 743, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/data.py", line 957, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/data.py", line 404, in _from_pandas_df
    data, feature_names, feature_types = _transform_pandas_df(
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/data.py", line 378, in _transform_pandas_df
    _invalid_dataframe_dtype(data)
  File "/Users/james/mambaforge/envs/dask-py39/lib/python3.9/site-packages/xgboost/data.py", line 270, in _invalid_dataframe_dtype
    raise ValueError(msg)
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, The experimental DMatrix parameter`enable_categorical` must be set to `True`.  Invalid columns:x: int64[pyarrow], y: double[pyarrow]

It looks like support for pandas nullable extension dtypes (e.g. Int64, Float64, etc.) has already been added to xgboost (xref https://github.com/dmlc/xgboost/pull/7760, https://github.com/dmlc/xgboost/pull/8480) and it would be great if pyarrow-backed extension dtypes were also supported.

trivialfis commented 1 year ago

Hi @jrbourbeau , thank you for raising the issue. A quick search didn't return any relevant document for enumerating the types. Could you please share some references I can use?

https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#ecosystem-extensions https://pandas.pydata.org/pandas-docs/stable/development/extending.html#compatibility-with-apache-arrow https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes

jrbourbeau commented 1 year ago

Thanks @trivialfis! Yeah, I'm under the impression that extensive documentation around the new pyarrow dtypes is still in development. IIUC I think any pyarrow data type (listed here https://arrow.apache.org/docs/python/api/datatypes.html) can now be converted to a pandas dtype through pd.ArrowDtype.

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: pd.ArrowDtype(pa.uint8())
Out[3]: uint8[pyarrow]

Also cc @mroeschke who is the expert in on the pandas side

mroeschke commented 1 year ago

Sorry this is not nicely document as of the pandas 1.5 release. The public API (namely pd.ArrowDtype and pd.arrays.ArrowExtensionArray) can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#pyarrow

Feel free to ping me with any further questions!

trivialfis commented 1 year ago

I gained some basic understanding of arrow recently due to work on CUDA IPC. I think XGBoost needs a new interface with pyarrow. Will try to build a prototype later.

trivialfis commented 1 year ago

Hey, out of curiosity, is this going to be used in dask in recent future? It would be really exciting!

jrbourbeau commented 1 year ago

I think XGBoost needs a new interface with pyarrow. Will try to build a prototype later.

Woo! Feel free to ping me into anything if you think it'd be useful

Hey, out of curiosity, is this going to be used in dask in recent future? It would be really exciting!

Yes! We're currently working on improving Dask's support for pyarrow-backed dtypes. FWIW I actually ran across this issue originally using xgboost.dask and a Dask DataFrame with int64[pyarrow], string[pyarrow], etc. data. If you have some examples of where pyarrow-backed dtypes have been / would be useful to you, @mroeschke and I would love to hear about them : )

trivialfis commented 1 year ago

Hi, I opened a PR https://github.com/dmlc/xgboost/pull/8653 . It's not completed yet since boolean and dictionary are not supported. Please help take a look when you have additional bandwidth (would be great if you have ways to improve the pandas handling code, it's becoming more and more complicated now).

dmlc / xgboost

Add support for using data with `pyarrow`-backed `pandas` extension dtypes #8598