HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

Some features may be misclassified as categorical #31

Open lorenz-gorini opened 4 years ago

lorenz-gorini commented 4 years ago

In the DataFrameWithInfo propriety "column_list_by_type", the classification of a categorical feature is chosen with a threshold value (from pd_extras.settings.CATEG_COL_THRESHOLD). This value indicates how many samples are expected to have the same repeated value if the feature is categorical.

For this reason, some features may be misclassified as categorical or non-categorical when they have a number of unique value that is around the threshold = not_NaN samples count / CATEG_COL_THRESHOLD.

My proposal is to add a new attribute for features that have a number of unique values around that threshold (a list of features that need clarification). So they can be classified by the user. Moreover the user should be able to define the type of some specific features and these user-defined type of the features should never be overwritten by the DataFrameWithInfo classification methods.