Open fjetter opened 7 months ago
https://github.com/pandas-dev/pandas/pull/57454
now we just need a release
This has been backported to 2.2.1
FWIW I'm seeing to a much lesser extend take_2d_axis1_object_object
calls being invoked and blocking the GIL despite the dataframe not having any object columns. From what I can tell the columns this is referring to are uint64
.
It looks like higher bit uints are not handled in https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L15-L35 and the 8bit uint seems to have a duplicate entry here https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L16-L17
I suspect uint is falling back to obj?
Do you have a query where this happens?
I'll take a look later today
This is query 2. This is the partition_index column, see also https://github.com/dask/distributed/issues/8530
the partition index column is currently a unit64 so casting this to a smaller value is appropriate. If pandas is not implementing a fastpath for uint we can also use a signed integer for now
I don't actually know the history behind not adding uint here (maybe wheel size, but not sure), so using a singed integer should be the fastest workaround for now
The uint thing is on purpose btw, we treat bool as uint8, so that's probably the reason that this was added
I noticed that tpch query 1 is spending only about half it's time in parquet IO when using the dataset that's been produced by pyarrow
However, the run is heavily GIL congested and another GIL+native profile reveals that actually very few things (in python) are holding on to the GIL (native-only threads, e.g. of pyarrow are not tracked by py-spy so we don't see how arrow holds the gil to produce the dataframe)
which points to the take pandas function. There's already been a recent fix to this code area (see https://github.com/pandas-dev/pandas/pull/54483) for axis0