ModelBuilder fails to use a CSV dataset that contains commas inside a quoted field on the first line

antoniovs1029 commented 4 years ago

As part of https://github.com/dotnet/machinelearning-modelbuilder/issues/702 's "Fix TextLoader" task, I was looking into improving the support of ML.NET's TextLoader for CSV, and I opened this PR https://github.com/dotnet/machinelearning/pull/5125. As explained there, ML.NET actually already supports loading regular CSV files, and the only thing that it couldn't load were new line characters inside quoted fields (which is fixed by the PR).

So, from the datasets mentioned in #702 under "Fix TextLoader", it turns out ML.NET couldn't open some datasets because of those new lines inside quoted fields: some of the Airbnb datasets and some of the jigsaw datasets.

On the other hand, ML.NET was able to load all the other datasets even without the fix on my PR, i.e.:

The remaining datasets from Airbnb (including the one from https://github.com/dotnet/machinelearning-modelbuilder/issues/327 )and jigsaw
~~text-emotion~~ (EDIT: I can actually load this file in ML.NET AND in ModelBuilder, so I don't know why was this included in #702).
sentiment140
titanic train.csv (from https://github.com/dotnet/machinelearning-modelbuilder/issues/452)

Despite this, ModelBuilder is unable to use them, having the following error message:

Unrecognized data format. Please check the input file to make sure it is a valid comma or tab separated file
   at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetCorrectDelimiter(String selectedFileName)
   at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetListOfColumns(String selectedFileName)
   at Microsoft.ML.ModelBuilder.ToolWindows.DataTabDataContext.GetDataLoadDimensions()
   at Microsoft.ML.ModelBuilder.ToolWindows.TextDataControl.SelectFileButton_Click(Object sender, RoutedEventArgs e)

These datasets don't include new lines inside quoted fields, so this is another issue.

After experimenting around, I realized that by deleting the commas inside quoted fields only in the first line (after the header), ModelBuilder was able to load the file and work with it (even if the other lines had these kind of commas).

After getting the output code from ModelBuilder, I ran the training code but using the original datasets (without the deleted commas), and it all worked fine in ML.NET. This worked even without the changes on my PR, so this means that the problem has never been in ML.NET's TextLoader.

I guess the problem is in ModelBuilder (or perhaps in AutoML.NET?) somewhere where the format of the file is checked only by looking at the first row, and it makes the mistake of thinking that commas inside a quoted fields are somehow invalid.

Please, let me know if there are still reasons to believe that this is a problem outside ModelBuilder/AutoML.NET (and perhaps particularly in TextLoader), so that I can try to look into it asap. Thanks! 😄

LittleLittleCloud commented 4 years ago

It's because ModelBuilder still pin to an older version of AutoML.Net so your fix isn't actually used here, The error should be solved after we upgrade to the latest ml.net after ml.net 1.5 is released

antoniovs1029 commented 4 years ago

Hi @LittleLittleCloud . What I'm trying to say is that our fix on my PR https://github.com/dotnet/machinelearning/pull/5125 won't fix this issue on modelbuilder (I haven't checked this as I wouldn't know how to build modelbuilder and test it, though).

And the reason I give for this, is that I can actually load in ML.NET all of those datasets even without the changes on my PR, so TextLoader has never been the problem behind these issues. And also that I can load them in ModelBuilder by deleting the commas inside quoted fields, whereas my PR doesn't actually changes anything about this because TextLoader has always supported commas inside quoted fields.

antoniovs1029 commented 4 years ago

After discussing this offline with @LittleLittleCloud :

We agreed that we are able to load the text-emotion.csv with ModelBuilder without any problem, so we aren't sure why was it included in the datasets on #702.
We agreed that the issue with the remaining datasets might be in the InferColumns() AutoML method (and particularly I insist it isn't in ML.NET's TextLoader).

So having people from the ModelBuilder team looking into the AutoML code would be helpful, since I'm not very familiar with it. Also, since I am prioritizing working in TextLoader related issues, please let me know if you find an specific reason to think this is actually an issue in TextLoader. Thanks.

LittleLittleCloud commented 4 years ago

A few words to add:

I remove the "text-emotion.csv" and "titanic.csv" file on #702 since the dataset parsing error for those datasets is not caused by mlnet/automl and has been fixed in the latest ModelBuilder. The left datasets (jgsaw) and Airbnb listing.csv) all contain newline between quotes and can cause dataset loading error on latest ModelBuilder, and the error should be fixed after @antoniovs1029 's fixed in.

antoniovs1029 commented 4 years ago

I'll close this issue, since it was already reported by @LittleLittleCloud that ModelBuilder didn't have any problems with the datasets I mentioned, and the other datasets (which had newline characters inside quoted fields) were addressed on https://github.com/dotnet/machinelearning/pull/5125.

Thanks. 😄

dotnet / machinelearning-modelbuilder

ModelBuilder fails to use a CSV dataset that contains commas inside a quoted field on the first line #747