When trousse.dataset.read_file and trousse.dataset.get_df_from_csv functions are used to read a CSV file, they use pandas.read_csv function to parse the CSV file.
And if we select the first element of column col0, its value will be:
>>> df['col0'][0]
'1'
and its type will be:
>>> type(df['col0'][0])
<class 'str'>
So even pandas function infer_dtype do not recognize that column as a mixed column:
>>> pd.api.types.infer_dtype(df['col0'])
'string'
In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a numerical value containing a typo), every value of that column will be interpreted as a string.
My proposal is to add a function inside Dataset.__init__ method that analyzes columns with dtype='object'.
For each found column, this function should call the pandas function pd.to_numeric(errors='ignore') (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html), that transforms to float/int each value that can be interpreted as float/int, while leaving the others untouched. This function should transform values like '2.1' and '2' into 2.1 and 2 changing the type of the elements.
This would mean that when the Dataset method _columns_type calls the pd.api.types.infer_dtype function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).
When
trousse.dataset.read_file
andtrousse.dataset.get_df_from_csv
functions are used to read a CSV file, they usepandas.read_csv
function to parse the CSV file.By choice, Pandas tries to avoid columns with mixed typed values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html), so when a column written in the CSV file contains numbers (int/float) and strings, the column will be loaded in a DataFrame (inside the
._data
attribute of Dataset) with a dtype='object'.The issue derives from the pandas behavior that, whenever a column is loaded from CSV file and its assigned
dtype
isobject
, all its values are casted tostring
. This means that if a CSV is similar to:(where in a integer column there is a typo like
0%
), the corresponding DataFrame has:And if we select the first element of column
col0
, its value will be:and its type will be:
So even pandas function
infer_dtype
do not recognize that column as amixed
column:In conclusion if a column of a CSV file contains at least one value that cannot be interpreted consistently with all the other types (e.g.: a numerical value containing a typo), every value of that column will be interpreted as a string.
My proposal is to add a function inside
Dataset.__init__
method that analyzes columns withdtype='object'
. For each found column, this function should call the pandas functionpd.to_numeric(errors='ignore')
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html), that transforms to float/int each value that can be interpreted as float/int, while leaving the others untouched. This function should transform values like'2.1'
and'2'
into2.1
and2
changing the type of the elements. This would mean that when the Dataset method_columns_type
calls thepd.api.types.infer_dtype
function, the inferred type will not be 'string', but 'mixed' instead (so that proper and expected inference will be performed).