"Unable to parse file" exception when trying to open CSV file with many columns

andrasfuchs commented 2 years ago

System Information (please complete the following information):

Model Builder Version: 16.9.3.2206002
Visual Studio Version: 17.0.5

Describe the bug On the Data tab I browsed for and selected a fairly big .csv file with 42626 columns. The following exception was thrown after a few seconds and the file was not loaded:

Unable to parse file. Only comma, tab or semi-colon delimited files are allowed.
   at StreamJsonRpc.JsonRpc.<InvokeCoreAsync>d__139`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.Actions.DataTextActions.<GetCsvDataProgramAsync>d__11.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.Actions.DataTextActions.<SetFilePathAsync>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.ToolWindows.DataTextViewModel.<SetFilePathAsync>d__108.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.ToolWindows.TextDataControl.<<BrowseFileButton_Click>b__5_0>d.MoveNext()

To Reproduce Steps to reproduce the behavior:

Go to your project in Solution Explorer
Click on "Add...", "Machine Learning Model"
Select "Data classification", "CPU" to get to the Data tab
Browse for a big file, mine can be downloaded from here.
See the parsing error

Expected behavior I would expect to load the file so that I can start the training.

Screenshots

Additional context I also had another file with 36537 columns and it was loaded fine. The file I had the exception with can be downloaded from here: https://drive.google.com/open?id=1DKRabjaiCQ91ydFD35w-VvUDNHOUHreb I also tried files with even more columns and they failed too.

beccamc commented 2 years ago

Thanks for reporting @andrasfuchs! I'll take a look.

andrasfuchs commented 2 years ago

No problem at all, thank you for taking a look at it.

Is there any other way I can help the development? I reported a few issues both here and in the machine learning repo, and I could work on those and do pull requests with the fixes. Do you see any chance that you open the source code for the model builder as well?

I'm kind of blocked at the moment with my project because of the bugs I reported, so I could spend my time on fixing them instead of waiting ;)

beccamc commented 2 years ago

@JakeRadMSFT to comment on if/when we are going open source.

@andrasfuchs I'll try and fix this one this week. Which other issues are blocking you?

andrasfuchs commented 2 years ago

@beccamc Thank you! This is the other one.

andrasfuchs commented 2 years ago

@beccamc Did you have the chance to look into this or the other issue last week?

andrasfuchs commented 2 years ago

@beccamc Could you please consider making Model Builder open source? It's really frustrating that I run into bugs again and again, and I can't fix them, and I'm not able to make workarounds. I had two new exceptions today, but I don't see any value in reporting them any more.

I expected some feedback from you (and hopefully from @JakeRadMSFT too) since you said in this thread that you will look into this issue more than 5 weeks ago. I'm working hard on workarounds everywhere I can, but I'm still blocked with some of my work.

As I said before, I would invest my time into fixing some of the bugs in Model Builder and I would gladly make pull requests to help your work if it were open source.

JakeRadMSFT commented 2 years ago

@andrasfuchs apologies for the delay.

You're hitting up against a few of our weak points here in the tooling that we're hoping to resolve over the coming months. We know data consumption and big data consumption is something we need to help provide a solution for.

We have a couple things things in the works that should help here -

We're bringing Model Builder's AutoML to the framework (which you already know is open source) - this should hopefully make it so you're not blocked on us :).
We're also working on a Big Data story
- We have .NET Notebook experience - https://marketplace.visualstudio.com/items?itemName=MLNET.notebook
- We have the .NET Data Frame - https://docs.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe
- We have Spark .NET https://dotnet.microsoft.com/en-us/apps/data/spark
We're also working on improving our UI experiences when there are a lot of columns ... but I think there will always be some point in which it becomes painful to use.

Right now it's a little disconnected but we're hoping to bring all these things together to solve the problem you're hitting.

I do have a question for you ...

Do you feel like you want/need a UI based solution for your data with that many columns? Would a code-first AutoML approach in a notebook meet your needs?

I'd love your feedback here: https://github.com/dotnet/machinelearning/pull/6118

andrasfuchs commented 2 years ago

@JakeRadMSFT Thank you for your update and suggestions.

.NET Notebook might not be my perfect fit, because I work with real-time data coming from a hardware device as a stream. That stream contains 4 million 32-bit float data points every 5 seconds that I feed into one or more trained models.

.NET Data Frame looks like a better match to my scenario, but I haven't work with it yet. I can see that its documentation indicated ML version 0.19. Is that correct, and is it publicly available for me to test?

Model Builder's UI challenges with many columns is not that critical for me anymore. I created my own .mbconfig file generator as a workaround, and it works okay.

Regarding to your question: UI is secondary for me if the foundations are working as expected and I have the ability to circumvent the limitations of the UI. I love DataRobot and its nice, intuitive GUI, but if I had to choose between a stable, reliable ML.NET handling my (probably extreme) data load and DataRobot, I would definitely prefer ML.NET.

If AutoML's code first approach solved the issues that block me (like #6035, #1975, #1986), and I could run it in a local environment, then sure, it would be excellent!

I would like to share a little more information about my project, not because I expect you to work towards my requirements, but to show you guys my use case that might help you understand my challenges. I'm working on a hardware device that detects mental states and health problems by analyzing the electromagnetic radiation measured on our forearms. I have 4 million FFT data points, but I reduced them into 20 thousand data point-buckets to do the training (ML.NET started to throw exceptions when I tried to have more). I have ~10 float labels and I expect them to reach ~100 in the next few months. I use regression training on my models regularly as I collect more data from my patients, and I also apply all those trained models on real-time data coming from my detector/sensor. I love how fast ML.NET models make predictions by the way!

Handling huge CSVs (#1217, #1986), multi-process CPU training (#1219), regression training on multiple labels, fast input loading (#1975) and generalized input and output handling among models (#1973) would all benefit my project, but I'm already fairly satisfied with ML.NET and the results I'm getting, and I'm grateful for the work you all do.

Let me know if I can help your work with data, source code or bug fixes.

beccamc commented 2 years ago

The specific csv issue reported here should be fixed by https://github.com/dotnet/machinelearning-tools/pull/1432

beccamc commented 2 years ago

Fix has been released. The problem here was a timeout in the package we use to predict column type. I increased the timeout. If it works intermittently we may need to look at making that timeout value configurable. Let me know!

andrasfuchs commented 2 years ago

Model Builder Version: v16.13.6.2226201 Visual Studio Version: Enterprise 2022 v17.2.1 Microsoft.ML package: v1.7

I can confirm that a file with 42'630 columns was loaded in 18 seconds without any issues.

Clicking on the Column to predict (Label) dropdown caused Visual Studio to freeze and I was unable to select any columns.

dotnet / machinelearning-modelbuilder

"Unable to parse file" exception when trying to open CSV file with many columns #1986