Closed andrasfuchs closed 2 years ago
Thanks for reporting @andrasfuchs! I'll take a look.
No problem at all, thank you for taking a look at it.
Is there any other way I can help the development? I reported a few issues both here and in the machine learning repo, and I could work on those and do pull requests with the fixes. Do you see any chance that you open the source code for the model builder as well?
I'm kind of blocked at the moment with my project because of the bugs I reported, so I could spend my time on fixing them instead of waiting ;)
@JakeRadMSFT to comment on if/when we are going open source.
@andrasfuchs I'll try and fix this one this week. Which other issues are blocking you?
@beccamc Thank you! This is the other one.
@beccamc Did you have the chance to look into this or the other issue last week?
@beccamc Could you please consider making Model Builder open source? It's really frustrating that I run into bugs again and again, and I can't fix them, and I'm not able to make workarounds. I had two new exceptions today, but I don't see any value in reporting them any more.
I expected some feedback from you (and hopefully from @JakeRadMSFT too) since you said in this thread that you will look into this issue more than 5 weeks ago. I'm working hard on workarounds everywhere I can, but I'm still blocked with some of my work.
As I said before, I would invest my time into fixing some of the bugs in Model Builder and I would gladly make pull requests to help your work if it were open source.
@andrasfuchs apologies for the delay.
You're hitting up against a few of our weak points here in the tooling that we're hoping to resolve over the coming months. We know data consumption and big data consumption is something we need to help provide a solution for.
We have a couple things things in the works that should help here -
Right now it's a little disconnected but we're hoping to bring all these things together to solve the problem you're hitting.
I do have a question for you ...
Do you feel like you want/need a UI based solution for your data with that many columns? Would a code-first AutoML approach in a notebook meet your needs?
I'd love your feedback here: https://github.com/dotnet/machinelearning/pull/6118
@JakeRadMSFT Thank you for your update and suggestions.
.NET Notebook might not be my perfect fit, because I work with real-time data coming from a hardware device as a stream. That stream contains 4 million 32-bit float data points every 5 seconds that I feed into one or more trained models.
.NET Data Frame looks like a better match to my scenario, but I haven't work with it yet. I can see that its documentation indicated ML version 0.19. Is that correct, and is it publicly available for me to test?
Model Builder's UI challenges with many columns is not that critical for me anymore. I created my own .mbconfig file generator as a workaround, and it works okay.
Regarding to your question: UI is secondary for me if the foundations are working as expected and I have the ability to circumvent the limitations of the UI. I love DataRobot and its nice, intuitive GUI, but if I had to choose between a stable, reliable ML.NET handling my (probably extreme) data load and DataRobot, I would definitely prefer ML.NET.
If AutoML's code first approach solved the issues that block me (like #6035, #1975, #1986), and I could run it in a local environment, then sure, it would be excellent!
I would like to share a little more information about my project, not because I expect you to work towards my requirements, but to show you guys my use case that might help you understand my challenges. I'm working on a hardware device that detects mental states and health problems by analyzing the electromagnetic radiation measured on our forearms. I have 4 million FFT data points, but I reduced them into 20 thousand data point-buckets to do the training (ML.NET started to throw exceptions when I tried to have more). I have ~10 float labels and I expect them to reach ~100 in the next few months. I use regression training on my models regularly as I collect more data from my patients, and I also apply all those trained models on real-time data coming from my detector/sensor. I love how fast ML.NET models make predictions by the way!
Handling huge CSVs (#1217, #1986), multi-process CPU training (#1219), regression training on multiple labels, fast input loading (#1975) and generalized input and output handling among models (#1973) would all benefit my project, but I'm already fairly satisfied with ML.NET and the results I'm getting, and I'm grateful for the work you all do.
Let me know if I can help your work with data, source code or bug fixes.
The specific csv issue reported here should be fixed by https://github.com/dotnet/machinelearning-tools/pull/1432
Fix has been released. The problem here was a timeout in the package we use to predict column type. I increased the timeout. If it works intermittently we may need to look at making that timeout value configurable. Let me know!
Model Builder Version: v16.13.6.2226201 Visual Studio Version: Enterprise 2022 v17.2.1 Microsoft.ML package: v1.7
I can confirm that a file with 42'630 columns was loaded in 18 seconds without any issues.
Clicking on the Column to predict (Label)
dropdown caused Visual Studio to freeze and I was unable to select any columns.
System Information (please complete the following information):
Describe the bug On the Data tab I browsed for and selected a fairly big .csv file with 42626 columns. The following exception was thrown after a few seconds and the file was not loaded:
To Reproduce Steps to reproduce the behavior:
Expected behavior I would expect to load the file so that I can start the training.
Screenshots
Additional context I also had another file with 36537 columns and it was loaded fine. The file I had the exception with can be downloaded from here: https://drive.google.com/open?id=1DKRabjaiCQ91ydFD35w-VvUDNHOUHreb I also tried files with even more columns and they failed too.