HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link

"read_file" and "get_df_from_csv" functions load numerical values as string ones #85

Open lorenz-gorini opened 4 years ago

lorenz-gorini commented 4 years ago

When trousse.dataset.read_file and trousse.dataset.get_df_from_csv functions are used to read a CSV file, they use pandas.read_csv function to parse the CSV file.

By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains numbers (int/float) and strings, the column will be loaded in a DataFrame (inside the ._data attribute of Dataset) with a dtype='object'.
The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned dtype is object, all its values are casted to string. This means that if a CSV is similar to:

,col0,col1
0,1,True
1,1,False
2,0%,True
3,0,True
4,0,True

(where in a integer column there is a typo like 0%), the corresponding DataFrame has:

>>> import pandas as pd
>>> df = pd.read_csv(CSV_PATH)
>>> df['col0'].dtype
'object'

And if we select the first element of column col0, its value will be:

>>> df['col0'][0]
'1'

and its type will be:

>>> type(df['col0'][0])
<class 'str'>

So even pandas function infer_dtype do not recognize that column as a mixed column:

>>> pd.api.types.infer_dtype(df['col0'])
'string'

In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a numerical value containing a typo), every value of that column will be interpreted as a string.

My proposal is to add a function inside Dataset.__init__ method that analyzes columns with dtype='object'. For each found column, this function should call the pandas function pd.to_numeric(errors='ignore') (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html), that transforms to float/int each value that can be interpreted as float/int, while leaving the others untouched. This function should transform values like '2.1' and '2' into 2.1 and 2 changing the type of the elements. This would mean that when the Dataset method _columns_type calls the pd.api.types.infer_dtype function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).