Open jrbourbeau opened 1 year ago
Hi @jrbourbeau , thank you for raising the issue. A quick search didn't return any relevant document for enumerating the types. Could you please share some references I can use?
https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#ecosystem-extensions https://pandas.pydata.org/pandas-docs/stable/development/extending.html#compatibility-with-apache-arrow https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes
Thanks @trivialfis! Yeah, I'm under the impression that extensive documentation around the new pyarrow
dtypes is still in development. IIUC I think any pyarrow
data type (listed here https://arrow.apache.org/docs/python/api/datatypes.html) can now be converted to a pandas
dtype through pd.ArrowDtype
.
In [1]: import pandas as pd
In [2]: import pyarrow as pa
In [3]: pd.ArrowDtype(pa.uint8())
Out[3]: uint8[pyarrow]
Also cc @mroeschke who is the expert in on the pandas
side
Sorry this is not nicely document as of the pandas 1.5 release. The public API (namely pd.ArrowDtype
and pd.arrays.ArrowExtensionArray
) can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#pyarrow
Feel free to ping me with any further questions!
I gained some basic understanding of arrow recently due to work on CUDA IPC. I think XGBoost needs a new interface with pyarrow. Will try to build a prototype later.
Hey, out of curiosity, is this going to be used in dask in recent future? It would be really exciting!
I think XGBoost needs a new interface with pyarrow. Will try to build a prototype later.
Woo! Feel free to ping me into anything if you think it'd be useful
Hey, out of curiosity, is this going to be used in dask in recent future? It would be really exciting!
Yes! We're currently working on improving Dask's support for pyarrow
-backed dtypes. FWIW I actually ran across this issue originally using xgboost.dask
and a Dask DataFrame with int64[pyarrow]
, string[pyarrow]
, etc. data. If you have some examples of where pyarrow
-backed dtypes have been / would be useful to you, @mroeschke and I would love to hear about them : )
Hi, I opened a PR https://github.com/dmlc/xgboost/pull/8653 . It's not completed yet since boolean and dictionary are not supported. Please help take a look when you have additional bandwidth (would be great if you have ways to improve the pandas handling code, it's becoming more and more complicated now).
In
pandas=1.5
,pandas
added support for usingpyarrow
-backed extension data dtypes. Using these data types (in particularstring[pyarrow]
) can lead to large performance improvements in terms of memory usage and computation wall time.I went to use these new dtypes with
xgboost
and got a (very informative) error about them not being supported. Here's a minimal reproducer:which outputs
It looks like support for
pandas
nullable extension dtypes (e.g.Int64
,Float64
, etc.) has already been added toxgboost
(xref https://github.com/dmlc/xgboost/pull/7760, https://github.com/dmlc/xgboost/pull/8480) and it would be great ifpyarrow
-backed extension dtypes were also supported.