HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

FeatureOperation: OneHotEncoder #52

Closed alessiamarcolini closed 3 years ago

alessiamarcolini commented 4 years ago

[3c ii]

leriomaggio commented 3 years ago

Can I kindly ask what is the whole idea here? I know we are supposed to be talking about this during our next call... so a bit of preparation would be ideal.

Following through, I could only see fixes (e.g. see #98) but I am not sure I got to what, exactly :D

Thanks.

alessiamarcolini commented 3 years ago

@leriomaggio This is a "feature request" issue: we want to have a FeatureOperation that performs one-hot encoding of a column.

The tricky point here is the NaNs handling in the column to encode. sklearn doesn't handle them and it simply raises an exception. We wanted to allow the presence of NaNs in the column and we proceeded this way:

  1. we find where the NaNs are in the column to encode and we store a "NaN mask" (as a boolean array)
  2. we replace NaNs with a specific string ("NAN_VALUE")
  3. we perform the encoding via sklearn
  4. we remove from the encoded columns the one corresponding to NAN_VALUE category
  5. we replace False values in the encoded columns in the rows corresponding to the rows where the original NaNs were (via the NaN mask)

Please let me know if you need further clarification before the call.

P.S.:

Following through, I could only see fixes (e.g. see #98) but I am not sure I got to what, exactly :D

I used the "fixes" as an auto-closing keyword to automatically link the PR to the issue