dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 254 forks source link

LabelEncoder doesn't handle missing values in *dask* series of strings #954

Open phobson opened 1 year ago

phobson commented 1 year ago

Describe the issue:

When using a LabelEncoder on a dask series with missing values (as np.nan), a TypeError is raised with "<" being undefined for floats and strings.

scikit-learn's encoder seems to handle this well for pandas and dask series. We seem to handle it well with a pandas series.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder as dask_le
from sklearn.preprocessing import LabelEncoder as skl_le
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": list("aaaabbbcccdddeeefffgggg")
})

df.loc[[0, 2, 5, 10, 21], "A"]  = np.nan

ddf = dd.from_pandas(df, npartitions=3)

# works
lenc = skl_le().fit(df["A"])
lenc = skl_le().fit(ddf["A"])
lenc = dask_le().fit(df["A"])

# fails
lenc = dask_le().fit(ddf["A"])

# but also works
lenc = dask_le().fit(ddf["A"].fillna(""))

Full Trackback:

➜ python label_encoder_repro.py
Traceback (most recent call last):
  File "/Users/paul/work/sources/dask-engineering/example-pipelines/criteo-HPO/label_encoder_repro.py", line 21, in 
    lenc = dask_le().fit(ddf["A"])
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 119, in fit
    self.classes_ = classes_.compute()
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 600, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in 
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/utils.py", line 71, in apply
    return func(*args, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/routines.py", line 1626, in _unique_internal
    u = np.unique(ar)
  File "<__array_function__ internals>", line 180, in unique
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 274, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts, 
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 336, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'str' and 'float'

Environment:

DuanBoomer commented 1 year ago

Tags: @phobson Hello, can I work on the issue titled "LabelEncoder doesn't handle missing values in dask series of strings #954".

phobson commented 1 year ago

@DuanBoomer I'd be happy to review a PR. Thanks for volunteering. Note that I'll be largely away from my computer this week through the New Year. So if my response time is slow, I haven't forgotten about you.

DuanBoomer commented 1 year ago

@phobson The PR will be submitted by Sunday if that's okay with you. Today is Monday.