fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
2.01k stars 94 forks source link

[FEATURE] Adopt pandas `ExtensionDType` #506

Closed goodwanghan closed 1 year ago

goodwanghan commented 1 year ago

Currently, we try not to directly use the dtypes of Pandas because it is often misleading. For example an int64 column will be converted to a float column or object column when there are Nones. The whole triad and fugue packages are built upon this assumption (pandas dtypes are not reliable). Hence there is a lot of extra logic to handle the situation correctly and efficiently.

Pandas ExtensionDType has existed for a long time, these types are nullable, and in the latest pandas they are called nullable_numpy. Since it exists for a long time, the newer versions of packages that Fugue use all support them well. So it is time to switch to ExtensionDType to simplify the internal logic, this could also make certain features faster.

In Pandas 2, ArrowDtype is introduced in the pyarrow backend. This is very new, and not well supported by multiple major dependencies. So it is not time to use it. But if a pandas dataframe contains ArrowDtype, we should be able to convert them to safe types so all backends and accept.

This change will require some significant upgrade of the package lower bounds. For example we no longer can support Python 3.7. PySpark has bug-free support of ExtensionDType only since 3.3, so we need to have some alternative solution for lower version pyspark, at the cost of speed and complexity. Pyarrow lower bounds has been changed to 6.0.1 but the best version (to work correctly and efficiently without special handling) is 11+.