finncatling / lap-risk

Uncertainty-aware mortality risk modelling in emergency laparotomy, using data from the NELA.
MIT License
5 stars 0 forks source link

BDAU environment restarting during 08_train_eval_novel_model.py #93

Closed finncatling closed 3 years ago

finncatling commented 3 years ago

This has happened twice now.

finncatling commented 3 years ago

I have emailed the BDAU to see if our script is causing any issues. I'll also write some temporary code to save the NovelModel instance after every split of model training

finncatling commented 3 years ago

Now running in the BDAU using nohup, with the script output logging to outputs/7_and_8_testing.log

finncatling commented 3 years ago

Script crashed during training on train fold 103/120. It's not clear what caused the crash but doesn't look like a python error (there was no error message in the nohup log). Luckily, we had modified 08_train_eval_novel_model.py to save NovelModel after each split iteration by externalising the training loop usually implemented in NovelModel.fit():

for split_i in pb(
    range(novel_model.cat_imputer.tts.n_splits),
    prefix="Split iteration"
):
    novel_model._single_train_test_split(split_i)
    save_object(
        novel_model,
        os.path.join(NOVEL_MODEL_OUTPUT_DIR, "08_novel_model.pkl"))

so it was easy to modify the script to resume training as follows:

novel_model: NovelModel = load_object(
    os.path.join(NOVEL_MODEL_OUTPUT_DIR, "08_novel_model.pkl"))

for split_i in pb(
    range(len(novel_model.models) - 2, novel_model.cat_imputer.tts.n_splits),
    prefix="Split iteration"
):
    novel_model._single_train_test_split(split_i)
    save_object(
        novel_model,
        os.path.join(NOVEL_MODEL_OUTPUT_DIR, "08_novel_model.pkl"))

...

This completed the training without issue.