JuliaML / TableTransforms.jl

Transforms and pipelines with tabular data in Julia
https://juliaml.github.io/TableTransforms.jl/stable
MIT License
103 stars 15 forks source link

OneHot should probably produce categorical arrays #158

Closed juliohm closed 1 year ago

juliohm commented 1 year ago

Currently the resulting columns are just Vector{Bool}, maybe they should be categorical arrays.

juliohm commented 1 year ago

@vickydeka would like to give this issue a try? It should be as easy as adding a call to categorical in the right places. It is a good opportunity to keep practicing contributions to open source on GitHub :)

rajatrc1705 commented 1 year ago

Hey, can I work on this issue if no one has started work on this? I want to get started with contributing to Julia, if you could assign me this, I can get started on getting familiar with the related aspects of this issue and submit a PR.

ablaom commented 1 year ago

I'm not sure this is a good idea. Ordinarily one one-hot-encodes so that models designed for "continuous" (abstract float) data can be applied when the data is categorical. To encode as categorical is a step backwards, no?

rajatrc1705 commented 1 year ago

I think the OG issue is recommending to convert the Bool type of the columns generated after OneHot process to CategoricalArray type

image

Current Behavior:

image

New Behavior:

image

although I am not aware of the reason behind doing this.

juliohm commented 1 year ago

I'm not 100% sure of what would be the most appropriate behavior. I know that one-hot encoding is often used as "continuous" 0s and 1s but I also have seen cases where it is treated as a binary variable. Maybe a keyword option would solve the problem.

Em sex., 16 de dez. de 2022 23:28, rajatrc1705 @.***> escreveu:

I think the OG issue is recommending to convert the Bool type of the columns generated after OneHot process to CategoricalArray type

[image: image] https://user-images.githubusercontent.com/52173002/208219270-a213f657-2f1c-4e77-a5f6-fc4644ecba5f.png Current Behavior:

[image: image] https://user-images.githubusercontent.com/52173002/208219341-82c63605-27b9-480e-ba35-c0f48cf8295b.png New Behavior:

[image: image] https://user-images.githubusercontent.com/52173002/208219356-fc8d803e-bf73-4455-9148-d873eb32073f.png

although I am not aware of the reason behind doing this.

โ€” Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/158#issuecomment-1355969553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3IWJGPZNXCP4IOEKI3WNUQNXANCNFSM6AAAAAASVMRCDI . You are receiving this because you authored the thread.Message ID: @.***>

eliascarv commented 1 year ago

Like this:

OneHot(:x, categorical=true)

By default, the keyword categorical can be equal to false.

juliohm commented 1 year ago

Exactly @eliascarv. I would set the default to true though as the continuous interpretation is more low-level.

@rajatrc1705 can you work on adding this option? That would be a nice contribution.

rajatrc1705 commented 1 year ago

I will start working on it! Could anyone confirm whether the images I posted for the "New Behavior" is what we want as the output?

juliohm commented 1 year ago

Confirmed ๐Ÿ‘๐Ÿป

Em sรกb., 17 de dez. de 2022 14:50, rajatrc1705 @.***> escreveu:

I will start working on it! Could anyone confirm whether the images I posted for the "New Behavior" is what we want as the output?

โ€” Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/158#issuecomment-1356353442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3LGNPNBMU2UTVXV3A3WNX4OVANCNFSM6AAAAAASVMRCDI . You are receiving this because you authored the thread.Message ID: @.***>

ablaom commented 1 year ago

Sorry for chiming in late here, but I would say that generally it's preferable for the type (or scitype) of the output of a transformer be as insensitive to hyperparameters as possible. That way, transformers can be chained and remain valid when just one component hyperparameter is changed. So, in this case, a preferred solution would have been to add a transformer that changes Bool columns to Categorical columns for chaining with the encoder as previously implemented.

juliohm commented 1 year ago

The issue here is a bit different though. The design of OneHot had the wrong scitype by default and we fixed that. In order to preserve the continuous 0s and 1s use case we added a kwarg option.

Em seg., 9 de jan. de 2023 01:21, Anthony Blaom, PhD < @.***> escreveu:

Sorry for chiming in late here, but I would say that generally it's preferable for the type (or scitype) of the output of a transformer be as insensitive to hyperparameters as possible. That way, transformers can be chained and remain valid when just one component hyperparameter is changed. So, in this case, a preferred solution would have been to add a transformer that changes Bool columns to Categorical columns for chaining with the encoder as previously implemented.

โ€” Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/issues/158#issuecomment-1375031047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3KOMH2CEROGHDWMX6LWRNY4JANCNFSM6AAAAAASVMRCDI . You are receiving this because you modified the open/close state.Message ID: @.***>