dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.99k stars 1.88k forks source link

LoadFromTextFile unable to correctly parse CSV file #4367

Closed aslotte closed 4 years ago

aslotte commented 4 years ago

System information

Issue

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

aslotte commented 4 years ago

Selecting the same file using the Model Builder causes Visual Studio 2019 to crash as well. Appears to be an issue reading the .csv file, although Excel is able to parse it.

aslotte commented 4 years ago

Same issue occurs when loading this dataset: https://data.world/promptcloud/product-details-on-flipkart-com

I first thought it had something to do with commas in the columns, but I've since then managed to eliminate that as a cause. The issue seems to be narrowed down to the description column. If that column is removed, the dataset is loaded correctly. Could it potentially be the line breaks in that column that's causing the issue?

aslotte commented 4 years ago

This is the error I see in the event logs. I'm running this on another computer than the one I used when seeing this issue for the first time (thus the Framework version). I would assume to see the same error on my other computer though.

Application: devenv.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: Microsoft.ML.AutoML.InferenceException
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferSplit(Microsoft.ML.MLContext, Microsoft.ML.AutoML.TextFileSample, System.Nullable`1<Char>, System.Nullable`1<Boolean>, System.Nullable`1<Boolean>)
   at Microsoft.ML.AutoML.ColumnInferenceApi.InferColumns(Microsoft.ML.MLContext, System.String, UInt32, Boolean, System.Nullable`1<Char>, System.Nullable`1<Boolean>, System.Nullable`1<Boolean>, Boolean, Boolean)
   at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetCorrectDelimiter(System.String)
   at Microsoft.ML.ModelBuilder.DataSources.FileDataSource.GetListOfColumns(System.String)
   at Microsoft.ML.ModelBuilder.ToolWindows.TrainModelDataContext.GetTotalFileColumn()
   at Microsoft.ML.ModelBuilder.TrainModelToolWindowControl.GetDataLoadDimensions()
   at Microsoft.ML.ModelBuilder.TrainModelToolWindowControl.SelectFileButton_Click(System.Object, System.Windows.RoutedEventArgs)
   at System.Windows.RoutedEventHandlerInfo.InvokeHandler(System.Object, System.Windows.RoutedEventArgs)
   at System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)
   at System.Windows.UIElement.RaiseEventImpl(System.Windows.DependencyObject, System.Windows.RoutedEventArgs)
   at System.Windows.UIElement.RaiseEvent(System.Windows.RoutedEventArgs)
   at System.Windows.Controls.Primitives.ButtonBase.OnClick()
   at System.Windows.Controls.Button.OnClick()
   at System.Windows.Controls.Primitives.ButtonBase.OnMouseLeftButtonUp(System.Windows.Input.MouseButtonEventArgs)
   at System.Windows.UIElement.OnMouseLeftButtonUpThunk(System.Object, System.Windows.Input.MouseButtonEventArgs)
   at System.Windows.Input.MouseButtonEventArgs.InvokeEventHandler(System.Delegate, System.Object)
   at System.Windows.RoutedEventArgs.InvokeHandler(System.Delegate, System.Object)
   at System.Windows.RoutedEventHandlerInfo.InvokeHandler(System.Object, System.Windows.RoutedEventArgs)
   at System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)
   at System.Windows.UIElement.ReRaiseEventAs(System.Windows.DependencyObject, System.Windows.RoutedEventArgs, System.Windows.RoutedEvent)
   at System.Windows.UIElement.OnMouseUpThunk(System.Object, System.Windows.Input.MouseButtonEventArgs)
   at System.Windows.Input.MouseButtonEventArgs.InvokeEventHandler(System.Delegate, System.Object)
   at System.Windows.RoutedEventArgs.InvokeHandler(System.Delegate, System.Object)
   at System.Windows.RoutedEventHandlerInfo.InvokeHandler(System.Object, System.Windows.RoutedEventArgs)
   at System.Windows.EventRoute.InvokeHandlersImpl(System.Object, System.Windows.RoutedEventArgs, Boolean)
   at System.Windows.UIElement.RaiseEventImpl(System.Windows.DependencyObject, System.Windows.RoutedEventArgs)
   at System.Windows.UIElement.RaiseTrustedEvent(System.Windows.RoutedEventArgs)
   at System.Windows.UIElement.RaiseEvent(System.Windows.RoutedEventArgs, Boolean)
   at System.Windows.Input.InputManager.ProcessStagingArea()
   at System.Windows.Input.InputManager.ProcessInput(System.Windows.Input.InputEventArgs)
   at System.Windows.Input.InputProviderSite.ReportInput(System.Windows.Input.InputReport)
   at System.Windows.Interop.HwndMouseInputProvider.ReportInput(IntPtr, System.Windows.Input.InputMode, Int32, System.Windows.Input.RawMouseActions, Int32, Int32, Int32)
   at System.Windows.Interop.HwndMouseInputProvider.FilterMessage(IntPtr, MS.Internal.Interop.WindowMessage, IntPtr, IntPtr, Boolean ByRef)
   at System.Windows.Interop.HwndSource.InputFilterMessage(IntPtr, Int32, IntPtr, IntPtr, Boolean ByRef)
   at MS.Win32.HwndWrapper.WndProc(IntPtr, Int32, IntPtr, IntPtr, Boolean ByRef)
   at MS.Win32.HwndSubclass.DispatcherCallbackOperation(System.Object)
   at System.Windows.Threading.ExceptionWrapper.InternalRealCall(System.Delegate, System.Object, Int32)
   at System.Windows.Threading.ExceptionWrapper.TryCatchWhen(System.Object, System.Delegate, System.Object, Int32, System.Delegate)
   at System.Windows.Threading.Dispatcher.LegacyInvokeImpl(System.Windows.Threading.DispatcherPriority, System.TimeSpan, System.Delegate, System.Object, Int32)
   at MS.Win32.HwndSubclass.SubclassWndProc(IntPtr, Int32, IntPtr, IntPtr)
aslotte commented 4 years ago

If I modify the description column in Excel by doing the following, it works

  1. Using the CLEAN function to remove non-printed values
  2. Wrap the cell content in quotations " "

This is obviously not a long term solution, but maybe we at least would like to include similar cleaning of non-printable chars when inferring columns?

gvashishtha commented 4 years ago

@najeeb-kazmi Do you know if this is expected behavior?

mstfbl commented 4 years ago

With the "product-details-on-flipkart-com" dataset @aslotte linked, I obtain the following Model Builder Error: Capture

mstfbl commented 4 years ago

Hey @aslotte , the first dateset you mentioned (https://data.world/promptcloud/fashion-products-on-amazon-com/workspace/file?filename=amazon_co-ecommerce_sample.csv) loads fine on ML.NET Model Builder on my end. However, I am also having issues with the second dataset. In addition, I am not able to import the flipkart dataset even after removing the "description" column as you have suggested.

mstfbl commented 4 years ago

Hi @aslotte , I believe I figured out the problem here. The Flipkart dataset you linked has empty values for certain fields, such as in line 13 for "retail price" and "discounted price" columns, which results in an error when a field that may be expecting a Boolean or an int instead sees an empty string.

The exception you have been receiving ("Exception Info: Microsoft.ML.AutoML.InferenceException") occurs when, according to InferenceException.cs, "... when AutoML is inferring the data type of a column." This means that AutoML was having issues deciding what type (int, string, boolean, etc..) a given variable was.

In the code I've written in TestCSVLoad.cs in this branch of mine, I manually load the flipkart dataset you mentioned. This is the class I've made for my IDataView:

https://github.com/mstfbl/machinelearning/blob/e41ad460c0384220bb91b7f797728e7f33817c07/docs/samples/Microsoft.ML.Samples/Dynamic/DataOperations/TestCSVLoad.cs#L34-L80

Currently, all of the data types here are strings. When I change the types of variables that should obviously be ints (such as RetailPrice and DiscountedPrice) or boolean (IsFkAdvantagedProduct), I obtain the following error:

Capture

I believe that the error you are having and the exception I get are due to the same cause, which is misaligned columns for LoadFromTextFile.