HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

Fix issue for rie modelcomputation #103

Closed lorenz-gorini closed 3 years ago

lorenz-gorini commented 3 years ago

Removed column conversion to category During analysis pytrousse should not perform conversion to category (this would make the column a pd.Categorical instance, instead of pd.Series, and these are not a numpy arrays).

Replaced hardcoded values with arguments for breed_specific_bin_splitting

alessiamarcolini commented 3 years ago

During analysis pytrousse should not perform conversion to category (this would make the column a pd.Categorical instance, instead of pd.Series, and these are not a numpy arrays).

I am not sure that this should be a concern of pytrousse.. shouldn't it be handled by who is using the data after?

lorenz-gorini commented 3 years ago

During analysis pytrousse should not perform conversion to category (this would make the column a pd.Categorical instance, instead of pd.Series, and these are not a numpy arrays).

I am not sure that this should be a concern of pytrousse.. shouldn't it be handled by who is using the data after?

I am not sure, because since the conversion makes categorical data different from the other data structures, it could be better that this conversion was performed through a FeatureOperation so that the user is fully aware.

leriomaggio commented 3 years ago

In my personal opinion this points out two levels of issues that need to be taken care of:

my2c

lorenz-gorini commented 3 years ago

In my personal opinion this points out two levels of issues that need to be taken care of:

* (A) Operations like `pd.Categorical` (or `astype("category")` should be part of `FeatureOperation` otherwise it won't be recorded.

* (B) In terms of API and processing: you should not rely on the return type depending this is `Categorical` or `Series`.
  This is simply bad practice.
  You should be using `to_numpy` instead to make sure that you're dealing with `numpy` arrays, whenever you will need/expect to.

my2c

Right. Thanks! Infact during Reference Interval model computation, I was using .values to get numpy array but pd.Categorical has not this attribute. Instead to_numpy() works on pd.Series and pd.Categorical