Closed tudor-turcu closed 1 month ago
Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see https://github.com/dotnet/machinelearning/pull/7242
@luisquintanilla, @JakeRadMSFT could you please review and merge my PR?
Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242
Thanks!
System Information (please complete the following information):
Describe the bug DataFrame.LoadCsv() or LoadCsvFromString() incorrectly detects a column type when renameDuplicatedColumns = true and dataTypes = null or empty.
To Reproduce Call DataFrame.LoadCsv() or LoadCsvFromString() with renameDuplicatedColumns = true and dataTypes = null or empty, with CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US If a column in the CSV contains on a row a valid date value, and in a subsequent row appear two or more empty string values, one of which appears in a previous column, the column containing a date is not considered as having a date type, but a single/float type. Parsing the date values on that column will fail with an exception. Probably the same issue appears for boolean columns or other types. Sample CSV:
Expected behavior The column type should considered to be Date type, even if it contains a few empty string values.
Screenshots, Code, Sample Projects
Additional context No crash if renameDuplicatedColumns: false or if dataTypes is set (no 'column type guessing') or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string.
Possible root cause: This seems to be a bug in Microosft.Data.Analysis.DataFrame - in ReadCsvLinesIntoDataFrame() function, only if renameDuplicatedColumns param is true, not only the duplicated column names are renamed, but also 'duplicated' row values are 'renamed': ex.: "345", "345.1", "345.2"... Several empty string values become: "", ".1", ".2", ".3". The code that tries to guess the column types will consider these former empty strings as float/single values under en-US culture, marking the entire column a having a single/float data type, even if no float values really exist on that column.