elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
635 stars 98 forks source link

Pandas major version 2 support #601

Open bartbroere opened 11 months ago

bartbroere commented 11 months ago

Last April Pandas released version 2.0.0, which introduces many breaking changes. I have been submitting some pull requests here (#596 #595 #593 #592). These fix some minor things to prepare for supporting pandas>=2.0.0. All the fixes until now do not immediately break pandas==1.5.0 support.

However, there are also some things issues that are a bit harder to upgrade to version 2, without perhaps breaking some of the previous functionality.

One such example is the fact that in aggregations such as groupby, pandas has ignored the sort parameter for a long time. Tests that compare the column order between eland and pandas will fail for either pandas 1.5.0 or pandas 2.0.0.

Is the Eland project planning a major release when starting to support pandas 2? Or will it support pandas 2 by implementing different behaviour based on runtime checks of pandas' version?

pquentin commented 11 months ago

Ideally we should support both versions as Pandas 1.x is still generally more popular than 2.x. Thanks for all the pull requests that are moving us in the right direction. We'll have to decide when we hit more thorny issues.

davidkyle commented 9 months ago

Pandas requires NumPy 1.22.4 minimum version. https://pandas.pydata.org/docs/dev/getting_started/install.html#dependencies

Because Shap is incompatible with NumPy >= 1.24 (#539) we will have to pin NumPy to the range numpy>=1.22.4,<1.24 when upgrading Pandas

pquentin commented 9 months ago

Looks like Shap is in a better shape now :) https://github.com/shap/shap/pull/2943. We could probably remove the numpy pin when CI is fixed. I opened https://github.com/elastic/eland/pull/636 for this.