Open TomAugspurger opened 2 months ago
This is a shuffle issue (and also present on the current implementation if I am not mistaken?)
df.shuffle("a") will lose your type, that's what we do under the hood if split_out != 1. shuffle_method="tasks" keeps it, disk and p2p lose it.
I can patch that so that your resulting DataFrame will have the correct type, but I don't know if we can guarantee that we keep whatever you might add to the subclass through shuffles without you overriding the shuffle specific methods
Describe the issue:
As part of https://github.com/geopandas/dask-geopandas/pull/285, we found that dask-expr will lose the type of a pandas DataFrame subclass in
groupby.agg
if (and only if?) thesplit_out
parameter is used.Minimal Complete Verifiable Example:
Given this file:
running that produces
I would expect the
type
there to be__main__.MyDataFrame
regardless ofsplit_out
.Anything else we need to know?:
Environment:
Edit: I made one addition to the script: adding a
@meta_nonempty.register(MyDataFrame)
. I noticed that inDecomposableGroupbyAggregation.combine
andDecomposableGroupbyAggregation.aggregate
the types were regular pandas DataFrames, instead of the subclass.Registering that
meta_nonempty
does keep it asMyDataFrame
initially. I put some print statements in those methods to print the type ofinputs[0]
andtype(_concat(inputs))
and getSo initially we're OK, but by the time we do the final
aggregate
we've lost the subclass.