HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

"read_file" and "get_df_from_csv" functions load boolean values as string ones #89

Open lorenz-gorini opened 3 years ago

lorenz-gorini commented 3 years ago

This issue is related and similar to issue #85 . When trousse.dataset.read_file and trousse.dataset.get_df_from_csv functions are used to read a CSV file, they use pandas.read_csv function to parse the CSV file.

By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains boolean values (i.e. True/False) along with typos (e.g. True% instead of True or 0 instead of False), the column will be loaded in a DataFrame (inside the ._data attribute of Dataset) with a dtype='object'.
The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned dtype is object, all its values are casted to string. This means that if a CSV is similar to:

,col0,col1
0,1,True
1,1,False
2,0,True%
3,0,True
4,0,True

(where in a boolean column there is a typo like True%), the corresponding DataFrame has:

>>> import pandas as pd
>>> df = pd.read_csv(CSV_PATH)
>>> df['col1'].dtype
'object'

And if we select the first element of column col1, its value will be:

>>> df['col0'][0]
'True'

and its type will be:

>>> type(df['col1'][0])
<class 'str'>

So even pandas function infer_dtype do not recognize that column as a mixed column:

>>> pd.api.types.infer_dtype(df['col1'])
'string'

In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a boolean value containing a typo), every value of that column will be interpreted as a string.

Similarly to issue #85 , my proposal is to add a function inside Dataset.__init__ method that analyzes columns with dtype='object'. For each found column, this function replaces 'True' with True and 'False' with False values. This would change the type of the single value from string to boolean, while leaving the others untouched. This would mean that when the Dataset method _columns_type calls the pd.api.types.infer_dtype function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).