dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

DataFrame when parsing a CSV incorrectly detects a column type when renameDuplicatedColumns is true #7240

Closed tudor-turcu closed 1 month ago

tudor-turcu commented 2 months ago

System Information (please complete the following information):

Describe the bug DataFrame.LoadCsv() or LoadCsvFromString() incorrectly detects a column type when renameDuplicatedColumns = true and dataTypes = null or empty.

To Reproduce Call DataFrame.LoadCsv() or LoadCsvFromString() with renameDuplicatedColumns = true and dataTypes = null or empty, with CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US If a column in the CSV contains on a row a valid date value, and in a subsequent row appear two or more empty string values, one of which appears in a previous column, the column containing a date is not considered as having a date type, but a single/float type. Parsing the date values on that column will fail with an exception. Probably the same issue appears for boolean columns or other types. Sample CSV:

"Col1","Col2","Col3","Col4"
"v1.1","5/7/2017","v3.1","v4.1"
"","","v3.2","v4.2"

Expected behavior The column type should considered to be Date type, even if it contains a few empty string values.

Screenshots, Code, Sample Projects

using Microsoft.Data.Analysis;
using System.Globalization;

namespace DataFrameColumnTypeDetectionBug
{
    internal class Program
    {
        static void Main(string[] args)
        {
            CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US

            string csv =
@"""Col1"",""Col2"",""Col3"",""Col4""
""v1.1"",""5/7/2017"",""v3.1"",""v4.1""
"""","""",""v3.2"",""v4.2""
";

            var dataFrame = DataFrame.LoadCsvFromString(csv, separator: ',', header: true, 
                dataTypes: null, // guess the column types
                renameDuplicatedColumns: true, // try to rename the duplicated columns, if any
                cultureInfo: CultureInfo.InvariantCulture);

            // when renameDuplicatedColumns: true will crash with:
            // System.FormatException: The input string '5/7/2017' was not in a correct format.
            // at System.Number.ThrowFormatException[TChar](ReadOnlySpan`1 value)
            //at System.String.System.IConvertible.ToSingle(IFormatProvider provider)
            //at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)
            //at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable`1 row, Boolean inPlace, CultureInfo cultureInfo)
            //at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(...)

            // no crash if renameDuplicatedColumns: false
            // or if dataTypes is set (no 'column type guessing')
            // or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string
        }
    }
}

Additional context No crash if renameDuplicatedColumns: false or if dataTypes is set (no 'column type guessing') or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string.

Possible root cause: This seems to be a bug in Microosft.Data.Analysis.DataFrame - in ReadCsvLinesIntoDataFrame() function, only if renameDuplicatedColumns param is true, not only the duplicated column names are renamed, but also 'duplicated' row values are 'renamed': ex.: "345", "345.1", "345.2"... Several empty string values become: "", ".1", ".2", ".3". The code that tries to guess the column types will consider these former empty strings as float/single values under en-US culture, marking the entire column a having a single/float data type, even if no float values really exist on that column.

asmirnov82 commented 2 months ago

Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see https://github.com/dotnet/machinelearning/pull/7242

@luisquintanilla, @JakeRadMSFT could you please review and merge my PR?

tudor-turcu commented 2 months ago

Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242

Thanks!