Overnight model search stopped training after ~20 evals

SteveOv commented 1 month ago

May be with TrainingTimeoutCallback()

SteveOv commented 1 month ago

Can't really see why it's not working but I can force a repro.

Training happily goes beyond eval ~20 if I manage the timeout via a LambdaCallback so I'm going to go with that approach.

SteveOv commented 1 month ago

OK, it happened again last night so it's unlikely to be the timeout. Another possibility is memory issues.

To be clear, when I say training stopped what actually happens is that models are no longer trained and seem to fail silently. Below from the last completed trial;

The time is now 07/20/24 01:17:48. Training will be stopped early if not completed by 07/20/24 02:17:48.

Epoch 1/250
1000/1000 - 40s - 40ms/step - loss: 0.2205 - mae: 0.2205 - mse: 0.1057 - r2_score: -6.8866e-02 - val_loss: 0.2168 - val_mae: 0.2168 - val_mse: 0.0997 - val_r2_score: -2.7535e-03
...
Epoch 13/250
1000/1000 - 38s - 38ms/step - loss: 0.2171 - mae: 0.2171 - mse: 0.1003 - r2_score: -1.2107e-04 - val_loss: 0.2166 - val_mae: 0.2166 - val_mse: 0.0996 - val_r2_score: 0.0011
Epoch 13: early stopping
Restoring model weights from the end of the best epoch: 6.

Evaluating model against 20000 test dataset instances.
1000/1000 - 4s - 4ms/step - loss: 0.2180 - mae: 0.2180 - mse: 0.1157 - r2_score: -2.3110e-01

Full evaluation against 30 formal-test dataset instances.
30/30 - 0s - 8ms/step - loss: 0.1759 - mae: 0.1759 - mse: 0.0672 - r2_score: -2.5328e-01

--------------------------------------------------------------------------------
Trial result: MAE = 0.217980 and MSE = 0.115703

count(trainable weights) = 625,286 yielding params(ln[weights]) = 13.345964 and:
              weighted loss(MSE*params) = 1.544166
              AIC = -8,807.275
              BIC = -8,722.958
--------------------------------------------------------------------------------

================================================================================
 [26/500] Best Trial: #20, status=ok, loss=0.055058, MAE=0.055058, MSE=0.023258 
================================================================================

and the following from the subsequent trial (with this pattern repeated until the requested number of evals is completed)

The time is now 07/20/24 01:26:09. Training will be stopped early if not completed by 07/20/24 02:26:09.

Epoch 1/250

================================================================================
 [27/500] Best Trial: #20, status=ok, loss=0.055058, MAE=0.055058, MSE=0.023258 
================================================================================

SteveOv commented 1 month ago

A possibility is with the tf/keras datasets which don't always appear to release resources in a timely fashion. The training/validation datasets are currently recreated for each eval, which is probably unneccessary so long as the tf.random behaviour is reset for each eval (which it is). I've reworked model_search to set up these datasets once, alongside the testing dataset.

SteveOv commented 1 month ago

Nope; doesn't fix it. Reverting changes.

SteveOv commented 1 month ago

This seems to have been resolved with a combination of the work on #70 and with the recent commits leading up to and including f3ba789, specifically moving from except OpError as exc: to except Exception as exc:.

I think what was happing was that non OpError exceptions were being thrown and consumed by fmin. The commit af42f93 resolves a possible source involving the inability to reuse an optimizer (which only became visible with the changes outlined above).

SteveOv / ebop_maven

Overnight model search stopped training after ~20 evals #73