dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
902 stars 256 forks source link

Categorizer should sort categories #916

Open miltava opened 2 years ago

miltava commented 2 years ago

Hello everyone,

We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.

The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match

The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.

We would expected get_dummies to work in both cases.

Thanks for the great work.

Milton

import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import Categorizer
from pandas.api.types import CategoricalDtype

pdf = pd.DataFrame(
    {
        "c1": ["a", "c"],
        "c2": ["c", "a"],
        "c3": ["d", "d"],
    },
)

# setting categories explicitly in the same order works
ddf = dd.from_pandas(pdf, npartitions=1)
cat = Categorizer(
    categories={
        "c1": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c2": CategoricalDtype(categories=["a", "c"], ordered=False),
        "c3": CategoricalDtype(categories=["d"], ordered=False),
    }
)
ddf = cat.fit_transform(ddf)
print(dd.get_dummies(ddf).compute())

# if categorizer infers the categories in a different
# order we'll get an exception on get_dummies
ddf = dd.from_pandas(pdf, npartitions=1)

cat = Categorizer()
ddf = cat.fit_transform(ddf)

print(ddf.compute())
# this will show that categories are inferred as 
# ['a', 'c'] and ['c', 'a'] for c1 and c2 respectively
# I believe this is causing the problem
print(cat.categories_)
print(dd.get_dummies(ddf).compute())

Environment:

miltava commented 2 years ago

Another curious thing is that the get_dummies works if c3 isn't present in the dataframe

jsignell commented 2 years ago

Thanks for writing this up @miltava! I am going to see if I can reproduce without dask-ml.

jsignell commented 2 years ago

Well I had trouble reproducing without dask-ml, but I did reproduce it with, and I think you are right in pointing to the difference in categories order as the source of the issue. In particular, it looks like the columns are just in a different order in the _meta (the tiny version of the df that we use to know what the dataframe looks like). than they are in the computed dataframe. Normally you can get around an issue like that by including enforce_metadata=False, but that's not quite the case for get_dummies since they have a special way of calculating meta. I am opening a PR that will make it less special. After that PR gets in you'll be able to do dd.get_dummies(ddf, enforce_metadata=False).compute(). Do you think this is good enough? I think the other option would be to try to sort the output columns which would make the resulting columns order not necessarily match pandas.

makquel commented 2 years ago

@jsignell that sounds fair enough, but I think that one of the reasons to use a Categorizer -> DummyEncoder pipeline is to be able to fit a partition that hasn't been used to train the encoder estimator before. Do you agree? The point I'm trying to make in pseudo-code would look like this:

# right before splitting the original dataset into train and test partitions
ce = Categorizer()
train_df = ce.fit_transform(train_df)
de = DummyEncoder()
train_df = de.fit_transform(train_df)

#later on
test_df = de.fit(test_df)

And by that extent, in most ML cases, using the get_dummies method wouldn't be good enough.

jsignell commented 2 years ago

I suspect you know a lot more than me about what the reasons are for using these things :smile: I don't have a good sense of what the issue is. Is there an exception that gets raised or something when you call de.fit later on?

miltava commented 2 years ago

Hi Julia,

Without knowing the specifics of the implementation, it seems to me that the best would be to have both the _meta and the computed dataframes in the same orders. But I have no idea how hard it would be.

I think the enforce_metadata could be a good compromise for those that are using dd.get_dummies directly.

And for those using Categorizer, I think a good compromise would be to always sort categories in the Categorizer (if they are not ordered).

Does that make sense?

Thanks

jsignell commented 2 years ago

Right so the _meta cannot know the order of the computed columns in this case, because the categories in the meta are ['a', 'b'] same as they are in the first partition. The _meta needs to be able to work without knowing things about about every single partition.

So I agree that the simplest solution here seems to be to sort the categories in Categorizer. With that in mind I am going to transfer this issue to the dask-ml github repo. You should feel welcome to open a pull request to make that change if you like!

SultanOrazbayev commented 2 years ago

This might be relevant (order of metadata affecting downstream operations): https://github.com/dask/dask/issues/9080

makquel commented 2 years ago

@jsignell how can I contribute to dask-ml repo? Do I need to open a PR first in order to?

jsignell commented 2 years ago

Sorry I missed this ping. Sure just open a PR. There are some detailed instructions in https://docs.github.com/en/get-started/quickstart/contributing-to-projects if you are unsure about how exactly to do that. There are also some development guidelines at https://docs.dask.org/en/stable/develop.html that are about dask more broadly but likely apply.