dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

P2P shuffling failed during transfer when data includes mixed type #8310

Open fjetter opened 1 year ago

fjetter commented 1 year ago

When there are object columns with mixed data types, the arrow backend cannot handle this.

The internal error that is raised in this example is

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column mixed_stuff with type object')

while the user receives a generic shuffle failed exception

RuntimeError: P2P shuffling [id] failed during transfer phase

Reproducing code example

import pandas as pd
import dask.dataframe as dd
import numpy as np
from distributed import Client
with Client() as client:

    df = pd.DataFrame({
        "mixed_stuff": [{"foo": "bar"}, np.array((3,))] * 2,
        "int": [1, 2] * 2,
    })
    ddf = dd.from_pandas(df, npartitions=2)
    ddf.shuffle(on="int").compute()
Cognitus-Stuti commented 3 months ago

were you able to find a solution to the same?