How can I retrain a model that already trained before?

alitokmakci commented 2 years ago

Hello, I have a PersistentModel with MultilayerPerceptron and I trained this model before. I have another dataset which has more than 500.000 rows. I want to train a model that already trained before. I tried partially training but it does not work for me. Where is my mistake or how can I achieve this? Codes:

// train.php
$estimator = new PersistentModel(....);

$estimator->train($dataset);

$estimator->save();

// retrain.php
$estimator = PersistentModel::load(new Filesystem('model.rbx'));

$estimator->partial($dataset);

$estimator->save();

When I execute retrain.php with very small dataset model.rbx's filesize decreases from 35MB to 25MB and its accuracy also decreases like 0.87 to 0.4. I hope I explained my problem clearly. Thanks for the amazing job!

andrewdalpino commented 2 years ago

Hey @alitokmakci thanks for the feedback. It looks like you're code is correct. I've long been wondering what issues we could run into with Online learning. We might have to dig a little to figure this one out. I'll give a brief summary of how partial training works with the MLP and hopefully, that may trigger some further discussion.

The neural net subsystem (used by MLP) is trained with mini-batch Gradient Descent. In addition, most of the pluggable Gradient Descent Optimizers are adaptive, meaning they carry a state and use that state to make the next parameter update. The early stopping mechanism depends on a few things at the end of each epoch - the minimum change in the loss function and if the model is not improving on the validation set to name a couple.

When partially training, the model will try to minimize the loss on the new dataset. Therefore, if the new dataset does not contain samples that would appear in the initial training set, those samples will effectively be "forgotten" and might explain the loss in accuracy. I'd be interested to know what the results are if you were to combine both the new and old datasets, randomize and split them, and then partially trained the model with each of the combined datasets. Would the accuracy scores be fairly similar or would they still be divergent?

As far as the change in file size goes, after initialization, the total number of parameters in the neural network does not change. Training only modifies their values but it does not remove or introduce more params. So this result is puzzling. I don't have any explanation for this off the top of my head.

flavio-schoute commented 2 years ago

Is this problem solved? Because training the model with new data is the importantent future in ML

andrewdalpino commented 1 year ago

Hey @Snicser, you can use Online Learners to train a model with new data after it has already been trained.

https://docs.rubixml.com/2.0/online.html

RubixML / ML

How can I retrain a model that already trained before? #204