elastic / ml-cpp

Machine learning C++ code
Other
149 stars 62 forks source link

[ML] Train models within the given memory limit #1299

Open valeriy42 opened 4 years ago

valeriy42 commented 4 years ago

In the current workflow a DFA job will fail if memory requirement exceeds the specified limit, user then has to update the job settings and restart the job from the last checkpoint. This workflow supports users who want to get the best possible results out of our algorithms. However, as a starting experience it may be undesirable: a new user just wants the job to run without a (memory) failure.

In order to improve the user experience, I propose to train models within the specified memory limits. This means we keep splitting nodes and adding the new trees in the gradient boosting loop until we hit the memory limit. Then we treat this model as converged.

valeriy42 commented 4 years ago

I would argue that this should be the default workflow for dealing with the memory limit. In general, for realistically large problems we can always get a bit better results by throwing more resources (compute time and memory) in. Failure on exceeding memory limits should be an option for advanced users.

WDYT @sophiec20 @stevedodson @tveasey ?

valeriy42 commented 4 years ago

@tveasey I guess, if you follow this approach and then restart the job several times with increasing memory limits, you are effectively getting a hyperband approach! Hence, we could suggest it as a "recipe" usage pattern instead of implementing it explicitly.

sophiec20 commented 4 years ago

What is the downside of running within a given limit? Do we get worse results or worse runtime or something else?

Surely there is a minimum memory needed? Is this the case? Presumably this still means we have to handle the case where we would be best to fail due to insufficient memory.

valeriy42 commented 4 years ago

There is an absolute minimum of memory we need to load the data frame. But on top of it it is basically the number of nodes in the forest you can use. Bluntly speaking, the more nodes the higher the accuracy, but you have diminishing returns from getting more nodes allowed.

tveasey commented 4 years ago

I think this is a very good proposal.

As we've discussed offline there is reason to believe that our overall training strategy should perform rather well subject to this sort of stopping criterion; we'll even naturally explore hyperparameter settings most suitable given the constraint. It will also speed up training, so provides a potential route to allow people achieving good enough results faster (short of having training time budget).

I don't think this functionality is required for GA, but it is definitely something I think we should implement at some point.

tveasey commented 3 years ago

Adding another thought. Currently, we cache aggregate gain derivatives on the nodes during training (the bulk of their memory usage) so we only need to compute derivatives for the node with fewer rows when estimating their split gain (we use the fact that parent(agg derivs) = left_child(agg derivs) + right_child(agg derivs)). If we are running memory constrained we could allow the user to disable this runtime optimisation to train with much lower memory.