Make grid search of parameter space more efficient

mjmckp commented 6 years ago

The ML.Net library suffers from a lack of decoupling between data preparation and model training, required to do an efficient grid search over training parameters.

That is, ideally the API should be structured in such a way that it is possible to do the following:

Prepare the data set once, so that it can be re-used multiple times. As much as possible, any pre-training calculations should be done up front (or perhaps cached to be re-used). For large data sets, the overhead of repeating this step each time is significant, taking as long or longer than the training itself.
For algorithms with multiple training iterations, it should be straightforward to retain the intermediate trained models at each iteration (or at a specified set of iterations). This way, it is then easy to compute metrics for the intermediate models on training and validation data sets, and ultimately select one of the intermediate models for use in production without having to re-run the training.

For example, consider training a LightGBM model. This is the training method in LightGbmTrainerBase.cs:

        public void Train(RoleMappedData data)
        {
            Dataset dtrain;
            CategoricalMetaData catMetaData;
            using (var ch = Host.Start("Loading data for LightGBM"))
            {
                using (var pch = Host.StartProgressChannel("Loading data for LightGBM"))
                    dtrain = LoadTrainingData(ch, data, out catMetaData);
                ch.Done();
            }
            using (var ch = Host.Start("Training with LightGBM"))
            {
                using (var pch = Host.StartProgressChannel("Training with LightGBM"))
                    TrainCore(ch, pch, dtrain, catMetaData);
                ch.Done();
            }
            dtrain.Dispose();
            DisposeParallelTraining();
        }

In order to address point 1) above, the dtrain object returned by LoadTrainingData should be available to be re-used. This would require that the configuration parameters for data preparation are specified separately to those for training, instead of all thrown in together into the LightGbmArguments type.

Now, in regards to point 2) above, note that the TrainCore method calls WrappedLightGBMTraining.Train, which has the following structure:

        public static Booster Train(IChannel ch, IProgressChannel pch,
            Dictionary<string, object> parameters, Dataset dtrain, Dataset dvalid = null, int numIteration = 100,
            bool verboseEval = true, int earlyStoppingRound = 0)
        {
            // create Booster.
            Booster bst = new Booster(parameters, dtrain, dvalid);

            for (int iter = 0; iter < numIteration; ++iter)
            {
                // training logic
            }
            return bst;
        }

In order to get the intermediate models, this method should return Booster [] instead of just the final Booster (or perhaps instead in this case, the Booster object should support extraction of a prediction model which only contains the first N trees of the ensemble).

Perhaps there is already the facility to do this in ML.Net, but I'm unable to find anything from my reading of the source or any of the examples.

I think 99.9% of all machine learning research requires doing a parameter grid search at some stage, and hence this is essential functionality that should be as efficient as possible.

mjmckp commented 6 years ago

@TomFinley Interested to hear your thoughts on this issue, please?

Ivanidzo4ka commented 6 years ago

DRI RESPONSE: Let me poke @TomFinley one more time. Also worth to include @Zruty0 who working on our new API as well.

TomFinley commented 6 years ago

Sorry, I did not catch this until it was pointed out to me by @Ivanidzo4ka -- I have to imagine that there's a good way to get alerts like this beyond the little "bell" in the toolbar, but if there is I'm unaware of it. 😛

So the observation, as I understand it, is that many algorithms have a dataset preprocessing step, which has an intermediate format. While we should retain the more general API, why not have a separate API alongside that, where we factor out the dataset prep into the algorithm-specific layout, and the training using that algorithm-specific layout?

Of course, there's no reason why we couldn't, but there are a couple practical reasons why we haven't yet...

We didn't do it previously, because what we now call ML.NET was a tool, not an API -- which is to say, it had a command line, and it had a GUI, and some other interfaces, but what API it had was secondary -- so if we had factored out the code in that fashion, it would have been utterly useless to that effort (which was, again, previously the focus), since there would have been no "hook" for it the CLI/GUI (which treats the components it operates on as black boxes).

Now that we are trying to make this an API, we could certainly do it, and it would be valuable to do so... however, again, because we're trying to make this an API the focus (at least, among those people working in Microsoft) is mostly on trying to make sure that this API is well posed in general, rather than trying to optimize specific scenarios yet. Because, it doesn't matter how fast/good we make some scenarios if we've just shipped a fundamentally broken API for v1.0 (which is the focus of 99% of our efforts in the preview so far) which is not useful.

So for example: we could ask @codemzs to do this, but on the other hand he was working on, say, fundamental infrastructure like #705 . Or we could ask @sfilipi to do this, but that means that her work on making IEstimators/ITransformers consistent and well presented in MLContext doesn't get done. Both things are important, but the work that is getting done is, I'd argue, more urgent.

mjmckp commented 6 years ago

Thanks @TomFinley, yes, we need to be able to do two things efficiently (1) factor out the data preparation step (which requires a clear distinction between configuration parameters that apply to the data prep stage and the training stage), and (2) retain intermediate versions of models during training (for models that are trained iteratively).

These are fundamental features that are required to efficiently run parameter grid searches (which is the most time consuming part of working as a ML practitioner), without which ML.NET cannot be used for anything but toy datasets. I would really like to see ML.NET succeed (as we are primarily a .NET shop), but unfortunately we cannot bring it into our toolset until this issue is addressed.

TomFinley commented 6 years ago

Well, let's break it down. What proportion of the time do you observe in dataset prep?

Zruty0 commented 6 years ago

I agree, there are missing 'extension/configuration points' in ML.NET, and we don't give the user full fidelity to what every individual learner is doing. For the first version, our focus is to provide a single, consistent framework to all the data preparation steps and to all the learning algorithms. It inevitably comes with generalization, and hiding some of the specific properties of each algorithm.

I do not agree that the fact that we don't have these features yet is completely fundamentally disqualifying ML.NET as a 'real' machine learning framework, and downgrades it to the level of a 'toy datasets only'.

As @TomFinley stated, we are not currently working on this, and instead focus on the things that we deem more important for all customers. Let's capture these asks as issues and put them on our backlog, to be picked up when the time is right.

Alternatively, if you guys are feeling so strongly that this is an essential feature and a deal-breaker to your scenario, please feel free to fork the code and make the changes. We will happily review the code and get it in :)

mjmckp commented 5 years ago

Well, of course it depends on the algorithm and size of the dataset etc, etc. However, I raised an issue a while ago about this (https://github.com/dotnet/machinelearning/issues/256) showing that for the ML.NET implementation of GBDTs, the data preparation stage was taking 3 times as long as the training stage for a medium size data set. LightGBM has a similar data preparation stage (the creation of the native DataSet object) in which it does a full pass over the data, sorting each column, in order to build the bins for each feature.

Zruty0 commented 5 years ago

It seems that we are in agreement here: we could get a massive efficiency boost for sweeping scenarios if we could separate and checkpoint the 'data prep' phase from the 'training' phase inside FastTree / LightGBM.

And by 'data prep' here I mean 'scanning the training data and building in-memory bin representation'.

Absolutely, we should do this. We cannot really do this right now though.

codemzs commented 5 years ago

This issue as been brought internally as well where we can reuse the data prep step for multiple sweeps over the same data.

dotnet / machinelearning

Make grid search of parameter space more efficient #512