HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

Distinguish `numerical_col` from `med_exam_col_list` #7

Open lorenz-gorini opened 4 years ago

lorenz-gorini commented 4 years ago

When the user calls the column_list_by_type property of a DataFrameWithInfo instance, the columns of the DataFrame are analyzed and split into different sets according to types likenumerical_cols and bool_cols. These sets are partly reorganized and returned into a ColumnListByType instance which contains sets of column names, split into different categories based on types of the values contained (like numerical_cols, bool_cols, med_exam_col_list,...).

The two ColumnListByType attributes numerical_cols and med_exam_col_list are overlapping because inside column_list_by_type property, they are both created as the union of the sets numerical_cols and bool_cols, which are the columns only containing numerical and boolean values respectively.

The suggestion is to modify column_list_by_type property and return an object ColumnListByType with a numerical_cols attribute that does not include bool_cols set and to leave bool_cols columns included inside the med_exam_col_list attribute (which is meant to include all the columns that contain values suitable for numerical analysis (i.e. numerical or boolean values)), and inside bool_cols attribute.

Note: It is important to adapt this and the other repos accordingly (like smvet), especially where numerical_cols attribute is employed.