apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.76k stars 4.21k forks source link

[Bug]: Pandas 2.1 fails some categorical tests #28638

Open caneff opened 11 months ago

caneff commented 11 months ago

What happened?

When trying to support Pandas 2.1, pytest frames_test.py::GroupByTest::test_groupby_level_agg_3 fails with

E       AssertionError: Expression does not preserve partitioning!
E                       Expression: ComputedExpression[pre_combine_max_Series_139900651301776]
E                       Requires: Arbitrary
E                       Preserves: Arbitrary
E                       Input partitioning: Index
E                       Expected output partitioning: Index

Also, test_groupby_level_agg_6 fails the same way.

Note if I change test_groupby_level_agg_3 in the parameterized setup to be [3,0] instead of [0, 3] (so the categorical is first instead of last) it doesn't fail, but if I change test_groupby_level_agg_6 from [1, 'str'] to ['str', 1] it still fails, so it isn't a clearcut answer there.

I suspect this and #28637 are intertwined.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

caneff commented 11 months ago

R: @tvalentyn

tvalentyn commented 11 months ago

cc: @robertwb who might have ideas.

caneff commented 11 months ago

Chased this down and found the underlying pandas issue is that df.sort_index() isn't actually sorting under some arcane circumstances that aren't clear to me yet. Issue here: https://github.com/pandas-dev/pandas/issues/55379