Open mjmckp opened 6 years ago
@TomFinley Interested to hear your thoughts on this issue, please?
DRI RESPONSE: Let me poke @TomFinley one more time. Also worth to include @Zruty0 who working on our new API as well.
Sorry, I did not catch this until it was pointed out to me by @Ivanidzo4ka -- I have to imagine that there's a good way to get alerts like this beyond the little "bell" in the toolbar, but if there is I'm unaware of it. 😛
So the observation, as I understand it, is that many algorithms have a dataset preprocessing step, which has an intermediate format. While we should retain the more general API, why not have a separate API alongside that, where we factor out the dataset prep into the algorithm-specific layout, and the training using that algorithm-specific layout?
Of course, there's no reason why we couldn't, but there are a couple practical reasons why we haven't yet...
We didn't do it previously, because what we now call ML.NET was a tool, not an API -- which is to say, it had a command line, and it had a GUI, and some other interfaces, but what API it had was secondary -- so if we had factored out the code in that fashion, it would have been utterly useless to that effort (which was, again, previously the focus), since there would have been no "hook" for it the CLI/GUI (which treats the components it operates on as black boxes).
Now that we are trying to make this an API, we could certainly do it, and it would be valuable to do so... however, again, because we're trying to make this an API the focus (at least, among those people working in Microsoft) is mostly on trying to make sure that this API is well posed in general, rather than trying to optimize specific scenarios yet. Because, it doesn't matter how fast/good we make some scenarios if we've just shipped a fundamentally broken API for v1.0 (which is the focus of 99% of our efforts in the preview so far) which is not useful.
So for example: we could ask @codemzs to do this, but on the other hand he was working on, say, fundamental infrastructure like #705 . Or we could ask @sfilipi to do this, but that means that her work on making IEstimator
s/ITransformer
s consistent and well presented in MLContext
doesn't get done. Both things are important, but the work that is getting done is, I'd argue, more urgent.
Thanks @TomFinley, yes, we need to be able to do two things efficiently (1) factor out the data preparation step (which requires a clear distinction between configuration parameters that apply to the data prep stage and the training stage), and (2) retain intermediate versions of models during training (for models that are trained iteratively).
These are fundamental features that are required to efficiently run parameter grid searches (which is the most time consuming part of working as a ML practitioner), without which ML.NET cannot be used for anything but toy datasets. I would really like to see ML.NET succeed (as we are primarily a .NET shop), but unfortunately we cannot bring it into our toolset until this issue is addressed.
Well, let's break it down. What proportion of the time do you observe in dataset prep?
I agree, there are missing 'extension/configuration points' in ML.NET, and we don't give the user full fidelity to what every individual learner is doing. For the first version, our focus is to provide a single, consistent framework to all the data preparation steps and to all the learning algorithms. It inevitably comes with generalization, and hiding some of the specific properties of each algorithm.
I do not agree that the fact that we don't have these features yet is completely fundamentally disqualifying ML.NET as a 'real' machine learning framework, and downgrades it to the level of a 'toy datasets only'.
As @TomFinley stated, we are not currently working on this, and instead focus on the things that we deem more important for all customers. Let's capture these asks as issues and put them on our backlog, to be picked up when the time is right.
Alternatively, if you guys are feeling so strongly that this is an essential feature and a deal-breaker to your scenario, please feel free to fork the code and make the changes. We will happily review the code and get it in :)
Well, of course it depends on the algorithm and size of the dataset etc, etc. However, I raised an issue a while ago about this (https://github.com/dotnet/machinelearning/issues/256) showing that for the ML.NET implementation of GBDTs, the data preparation stage was taking 3 times as long as the training stage for a medium size data set. LightGBM has a similar data preparation stage (the creation of the native DataSet
object) in which it does a full pass over the data, sorting each column, in order to build the bins for each feature.
It seems that we are in agreement here: we could get a massive efficiency boost for sweeping scenarios if we could separate and checkpoint the 'data prep' phase from the 'training' phase inside FastTree / LightGBM.
And by 'data prep' here I mean 'scanning the training data and building in-memory bin representation'.
Absolutely, we should do this. We cannot really do this right now though.
This issue as been brought internally as well where we can reuse the data prep step for multiple sweeps over the same data.
The ML.Net library suffers from a lack of decoupling between data preparation and model training, required to do an efficient grid search over training parameters.
That is, ideally the API should be structured in such a way that it is possible to do the following:
For example, consider training a LightGBM model. This is the training method in
LightGbmTrainerBase.cs
:In order to address point 1) above, the
dtrain
object returned byLoadTrainingData
should be available to be re-used. This would require that the configuration parameters for data preparation are specified separately to those for training, instead of all thrown in together into theLightGbmArguments
type.Now, in regards to point 2) above, note that the
TrainCore
method callsWrappedLightGBMTraining.Train
, which has the following structure:In order to get the intermediate models, this method should return
Booster []
instead of just the finalBooster
(or perhaps instead in this case, theBooster
object should support extraction of a prediction model which only contains the firstN
trees of the ensemble).Perhaps there is already the facility to do this in ML.Net, but I'm unable to find anything from my reading of the source or any of the examples.
I think 99.9% of all machine learning research requires doing a parameter grid search at some stage, and hence this is essential functionality that should be as efficient as possible.