dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9k stars 1.88k forks source link

Max File Size Limits for Model Builder #3680

Closed danucalovj closed 5 years ago

danucalovj commented 5 years ago

Issue:

Using the ML.NET Model Builder, there is a max file size limit of 1 GB per the GUI:

image

No documentation on the limit available to my knowledge. If there is, can someone provide me the link?

Is there a way to change the limit?

And, is this a limit also using the AutoML CLI?

justinormont commented 5 years ago

@danucalovj: I've run on 1TB datasets in the CLI (Criteo 1TB). There is no enforced limit in the CLI. You will want to set --cache off if the AutoML begins to run out of memory. The CLI is also missing the ability to disable certain trainers; talking to the AutoML API directly gives you this ability. In the future, we'd like to automatically handle the memory management for the user.

Currently there are three interfaces to AutoML in ML.NET: Model Builder GUI, mlnet CLI, and AutoML API; each more configurable than the previous, though generally less simple to use.

@rustd: Is the Model Builder's 1GB limitation just to ensure a smooth use (keeping the required runtime in check)? Is the limit a recommendation or is it enforced?

rustd commented 5 years ago

This is the first preview of Model Builder so we wanted to have a smooth experience and hence put a limit of 1GB. There is no technical limitations. The limit is only temporary to ensure a smooth experience. We will try to provide way to override these limits. I apologize for the lack of documentation. We are working on putting out more content. You can see the existing docs at https://aka.ms/modelbuilderdocs /cc @JakeRadMSFT

danucalovj commented 5 years ago

Thank you for the clarification @justinormont and @rustd

I'll start using the CLI from now on with datasets > 1GB

justinormont commented 5 years ago

@danucalovj: Quite welcome of course.

rustd commented 5 years ago

@danucalovj out of curiosity can you please describe your use case and dataset characteristics?

danucalovj commented 5 years ago

Hi @rustd

Absolutely! I've been working alot with Keras + Tensorflow, and this is one example I've been porting into ML.Net (GLM):

https://github.com/danucalovj/GLMPerf

... Prediction of flight delays based on flight data from:

http://stat-computing.org/dataexpo/2009/the-data.html

Goal: Binary Classification, 0 for not delayed, 1 for delayed.

Every year of data is roughly 600-800 MB in size, so I'm working with 2007 and 2008 for now, as one joined dataset ~1.5 GB in size, a few million rows total. I can provide total row count tomorrow as I'm not in front of my computer right now.

I'm doing the cleanup of the data, column selection/exclusion and feature engineering separate of course since that capability isn't built in to Model Builder.

Let me know if you need any other details.

danucalovj commented 5 years ago

As a side note, for learning purposes I spend some time looking at problems at Kaggle. If you look through the competitions you'll notice most of the datasets are fairly large as well. I'd probably use Kaggle as a reference to types of ML tasks and datasets. I hope that helps.

rustd commented 5 years ago

Thank you for the information. This is very useful. Feature engineering is also on our roadmap for the future.