AI4OPT / ML4OPF

Machine Learning for Optimal Power Flow
MIT License
5 stars 4 forks source link

Continue training model that was previously trained #16

Open mtanneau opened 1 week ago

mtanneau commented 1 week ago

I used the starter code in the README to train a ACBasicNeuralNet for a few epochs.

from ml4opf import ACProblem

data_path = 'tests/test_data/89_pegase'
problem = ACProblem(data_path)

# make a basic neural network model
from ml4opf.models.basic_nn import ACBasicNeuralNet # requires pytorch-lightning

config = {
    "optimizer": "adam",
    "learning_rate": 1e-3,
    "loss": "mse",
    "hidden_sizes": [500,300,500],
    "activation": "sigmoid",
    "boundrepair": "sigmoid" # optionally clamp outputs to bounds (choices: "sigmoid", "relu", "clamp")
}

model = ACBasicNeuralNet(config, problem)  # setup ML model

# train ML model
model.train(trainer_kwargs={'max_epochs': 16, 'accelerator': 'auto'})

I then evaluated its performance, and after seeing the results, I would like to train it further. For instance, I would like to run it for another 16 epochs.

I tried the following

model.train(trainer_kwargs={'max_epochs': 16, 'accelerator': 'auto'})

which immediately stopped as the max number of epochs was reached.

>>> model.train(trainer_kwargs={'max_epochs': 16, 'accelerator': 'auto'})
.conda/envs/ml4opf/lib/python3.12/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory ML4OPF/lightning_logs/version_1/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type        | Params | Mode 
--------------------------------------------------
0 | violation | ACViolation | 0      | eval 
1 | loss      | MSELoss     | 0      | train
2 | layers    | Sequential  | 437 K  | train
--------------------------------------------------
437 K     Trainable params
0         Non-trainable params
437 K     Total params
1.750     Total estimated model params size (MB)
10        Modules in train mode
1         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=16` reached.

Calling model.train again with a higher number of epochs also terminates immediately

model.train(trainer_kwargs={'max_epochs': 32, 'accelerator': 'auto'})

with the same output

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type        | Params | Mode 
--------------------------------------------------
0 | violation | ACViolation | 0      | eval 
1 | loss      | MSELoss     | 0      | train
2 | layers    | Sequential  | 437 K  | train
--------------------------------------------------
437 K     Trainable params
0         Non-trainable params
437 K     Total params
1.750     Total estimated model params size (MB)
10        Modules in train mode
1         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=16` reached.

Is there a convenient way to continue/restart training, without loosing the current weights?

klamike commented 1 week ago

I agree it is unfortunate that simply calling BasicNeuralNet.train(...) again doesn't just resume training. Indeed, calling BasicNeuralNet.train(...) with non-empty trainer_kwargs should give a warning when there is an existing trainer, since those kwargs won't be used (currently there is just a debug message): https://github.com/AI4OPT/ML4OPF/blob/a778eb9ee9fb3d1fde4e5c66855cc07e16aa47f0/ml4opf/models/basic_nn/basic_nn.py#L88-L90

Anyway, to solve this problem I typically use this workaround of updating the trainer's internal max_epochs, then calling BasicNeuralNet.train() again (with no kwargs). See the train(model: BasicNeuralNet, epochs: int) function in tests/test_models.py (specifically line 73): https://github.com/AI4OPT/ML4OPF/blob/a778eb9ee9fb3d1fde4e5c66855cc07e16aa47f0/tests/test_models.py#L65-L86

The current training mechanism for BasicNeuralNet is just a very thin wrapper over the PyTorch Lightning Trainer.. I wouldn't be opposed to adding nice-to-haves like this to it in the future.