dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

Fix too strict assertion in shuffle code for pandas subclasses #8667

Closed jorisvandenbossche closed 3 weeks ago

jorisvandenbossche commented 3 weeks ago

I am testing the p2p shuffle in dask-geopandas (https://github.com/geopandas/dask-geopandas/pull/295), and this fix was needed to get it running.
Pandas does not require that the _constructor_sliced attribute for subclasses is a class (type) itself, it should just be a callable to construct the subclassed Series object (that is also how it is used here: it's only used on the line below as a callable worker_for = constructor(worker_for)). So this assertion is wrong (and not necessary anyway IMO, if this attribute would do something wrong, that's a problem with the subclass in general)

I haven't added a test for now (it would require to add a dummy pandas / dask collection subclass boilerplate (given you don't want to depend on an external package like dask-geopandas I think), but that seems a bit much for this simple change).

github-actions[bot] commented 3 weeks ago

Unit Test Results

_See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests._

    29 files  ±    0      29 suites  ±0   11h 20m 2s :stopwatch: + 1h 29m 32s  4 054 tests  -     5   3 954 :white_check_mark: +   10     97 :zzz:  -   9  3 :x:  - 2  55 841 runs  +7 583  53 677 :white_check_mark: +7 355  2 160 :zzz: +247  4 :x:  - 2 

For more details on these failures, see this check.

Results for commit 5032d663. ± Comparison against base commit cbc21dff.

This pull request removes 13 and adds 8 tests. Note that renamed tests count towards both. ``` distributed.protocol.tests.test_arrow distributed.protocol.tests.test_collection distributed.protocol.tests.test_highlevelgraph distributed.protocol.tests.test_numpy distributed.protocol.tests.test_pandas distributed.shuffle.tests.test_graph distributed.shuffle.tests.test_merge distributed.shuffle.tests.test_merge_column_and_index distributed.shuffle.tests.test_metrics distributed.shuffle.tests.test_rechunk … ``` ``` distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_scheduler_report_args[report_args0] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[1] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers[True] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[False] distributed.diagnostics.tests.test_memray ‑ test_basic_integration_workers_report_args[report_args0] ```

:recycle: This comment has been updated with latest results.

jorisvandenbossche commented 3 weeks ago

Ah, the assert was there just to satisfy the type checker ..

I added a type: ignore, but could also add a cast(constructor, Callable) or something like that if that is preferred