Open miltava opened 2 years ago
Another curious thing is that the get_dummies works if c3 isn't present in the dataframe
Thanks for writing this up @miltava! I am going to see if I can reproduce without dask-ml.
Well I had trouble reproducing without dask-ml, but I did reproduce it with, and I think you are right in pointing to the difference in categories
order as the source of the issue. In particular, it looks like the columns are just in a different order in the _meta
(the tiny version of the df that we use to know what the dataframe looks like). than they are in the computed dataframe. Normally you can get around an issue like that by including enforce_metadata=False
, but that's not quite the case for get_dummies
since they have a special way of calculating meta. I am opening a PR that will make it less special. After that PR gets in you'll be able to do dd.get_dummies(ddf, enforce_metadata=False).compute()
. Do you think this is good enough? I think the other option would be to try to sort the output columns which would make the resulting columns order not necessarily match pandas.
@jsignell that sounds fair enough, but I think that one of the reasons to use a Categorizer -> DummyEncoder pipeline is to be able to fit a partition that hasn't been used to train the encoder estimator before. Do you agree? The point I'm trying to make in pseudo-code would look like this:
# right before splitting the original dataset into train and test partitions
ce = Categorizer()
train_df = ce.fit_transform(train_df)
de = DummyEncoder()
train_df = de.fit_transform(train_df)
#later on
test_df = de.fit(test_df)
And by that extent, in most ML cases, using the get_dummies
method wouldn't be good enough.
I suspect you know a lot more than me about what the reasons are for using these things :smile: I don't have a good sense of what the issue is. Is there an exception that gets raised or something when you call de.fit
later on?
Hi Julia,
Without knowing the specifics of the implementation, it seems to me that the best would be to have both the _meta and the computed dataframes in the same orders. But I have no idea how hard it would be.
I think the enforce_metadata could be a good compromise for those that are using dd.get_dummies directly.
And for those using Categorizer, I think a good compromise would be to always sort categories in the Categorizer (if they are not ordered).
Does that make sense?
Thanks
Right so the _meta
cannot know the order of the computed columns in this case, because the categories in the meta are ['a', 'b']
same as they are in the first partition. The _meta
needs to be able to work without knowing things about about every single partition.
So I agree that the simplest solution here seems to be to sort the categories in Categorizer
. With that in mind I am going to transfer this issue to the dask-ml github repo. You should feel welcome to open a pull request to make that change if you like!
This might be relevant (order of metadata affecting downstream operations): https://github.com/dask/dask/issues/9080
@jsignell how can I contribute to dask-ml repo? Do I need to open a PR first in order to?
Sorry I missed this ping. Sure just open a PR. There are some detailed instructions in https://docs.github.com/en/get-started/quickstart/contributing-to-projects if you are unsure about how exactly to do that. There are also some development guidelines at https://docs.dask.org/en/stable/develop.html that are about dask more broadly but likely apply.
Hello everyone,
We are facing a problem when calling dd.get_dumies (or DummyEncoder) when using Categorizer to infer the categories.
The problem seems to arise when two columns have the same categorical values that appear in a different order. In that case, we will get a
ValueError: The columns in the computed data do not match the columns in the provided metadata Order of columns does not match
The example below shows that get_dummies works fine when we explicitly define the categories in the same order. But when Categorizer infers the categories (in a different order) we will get the ValueError.
We would expected get_dummies to work in both cases.
Thanks for the great work.
Milton
Environment: