Repeat individual tuning trials that terminate with errors?

jmmuncie commented 1 year ago

Hi,

I had a similar issue to the one raised in #1 - while tuning an accessibility topic model, I received the CUDA out of memory error message (CUDA out of memory. Tried to allocate...), and the tuning process terminated.

When I view the summary of tuning progress via mira.topics.print_study(study_atac), I see this:

Trials finished: 50 | Best trial: 33 | Best score: 1.8355e-01
Press ctrl+C,ctrl+C or esc,I+I,I+I in Jupyter notebook to stop early.

#Topics | Trials (number is #folds tested)
      5 | 1 
      7 | 1 
     10 | E 1 1 
     11 | 1 
     13 | 5 5 2 
     14 | 1 1 
     15 | 1 
     16 | 1 
     17 | 5 2 5 
     18 | 5 5 
     19 | 1 1 
     20 | 5 5 1 
     21 | 1 
     22 | 4 5 
     23 | 5 5 5 1 
     24 | 1 1 
     25 | 5 5 1 
     27 | 5 1 0 
     28 | 1 
     30 | 1 
     31 | 5 5 
     33 | 1 
     35 | 1 
     37 | 1 
     38 | 1 
     40 | 2 
     46 | 1 
     50 | 5 
     54 | 1 

Trial Information:
Trial #0   | completed, score: 1.8360e-01 | params: {'batch_size': 32, 'beta': 0.9383, 'encoder_dropout': 0.0693, 'kl_strategy': 'cyclic', 'num_epochs': 31, 'num_layers': 3, 'num_topics': 18}
Trial #1   | completed, score: 1.8378e-01 | params: {'batch_size': 32, 'beta': 0.9161, 'encoder_dropout': 0.0305, 'kl_strategy': 'cyclic', 'num_epochs': 21, 'num_layers': 2, 'num_topics': 13}
Trial #2   | completed, score: 1.8391e-01 | params: {'batch_size': 64, 'beta': 0.9177, 'encoder_dropout': 0.0689, 'kl_strategy': 'monotonic', 'num_epochs': 25, 'num_layers': 3, 'num_topics': 13}
Trial #3   | completed, score: 1.8358e-01 | params: {'batch_size': 32, 'beta': 0.9108, 'encoder_dropout': 0.1350, 'kl_strategy': 'monotonic', 'num_epochs': 38, 'num_layers': 2, 'num_topics': 18}
Trial #4   | completed, score: 1.8366e-01 | params: {'batch_size': 64, 'beta': 0.9434, 'encoder_dropout': 0.1199, 'kl_strategy': 'monotonic', 'num_epochs': 36, 'num_layers': 2, 'num_topics': 17}
Trial #5   | completed, score: 1.8382e-01 | params: {'batch_size': 64, 'beta': 0.9492, 'encoder_dropout': 0.1039, 'kl_strategy': 'cyclic', 'num_epochs': 26, 'num_layers': 3, 'num_topics': 25}
Trial #6   | pruned at step: 2            | params: {'batch_size': 64, 'beta': 0.9743, 'encoder_dropout': 0.0222, 'kl_strategy': 'cyclic', 'num_epochs': 20, 'num_layers': 2, 'num_topics': 17}
Trial #7   | pruned at step: 2            | params: {'batch_size': 128, 'beta': 0.9191, 'encoder_dropout': 0.0416, 'kl_strategy': 'cyclic', 'num_epochs': 21, 'num_layers': 3, 'num_topics': 13}
Trial #8   | ERROR                        | params: {'batch_size': 128, 'beta': 0.9146864520093327, 'encoder_dropout': 0.021906189591768904, 'kl_strategy': 'monotonic', 'num_epochs': 20, 'num_layers': 3, 'num_topics': 10}
Trial #9   | pruned at step: 1            | params: {'batch_size': 128, 'beta': 0.9492, 'encoder_dropout': 0.0927, 'kl_strategy': 'cyclic', 'num_epochs': 20, 'num_layers': 3, 'num_topics': 14}
Trial #10  | completed, score: 1.8362e-01 | params: {'batch_size': 32, 'beta': 0.9040, 'encoder_dropout': 0.1483, 'kl_strategy': 'monotonic', 'num_epochs': 40, 'num_layers': 2, 'num_topics': 50}
Trial #11  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9339, 'encoder_dropout': 0.0698, 'kl_strategy': 'monotonic', 'num_epochs': 33, 'num_layers': 2, 'num_topics': 5}
Trial #12  | completed, score: 1.8365e-01 | params: {'batch_size': 32, 'beta': 0.9761, 'encoder_dropout': 0.1466, 'kl_strategy': 'monotonic', 'num_epochs': 31, 'num_layers': 3, 'num_topics': 31}
Trial #13  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9033, 'encoder_dropout': 0.0542, 'kl_strategy': 'monotonic', 'num_epochs': 38, 'num_layers': 3, 'num_topics': 27}
Trial #14  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9322, 'encoder_dropout': 0.1191, 'kl_strategy': 'cyclic', 'num_epochs': 30, 'num_layers': 2, 'num_topics': 7}
Trial #15  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9894, 'encoder_dropout': 0.0949, 'kl_strategy': 'monotonic', 'num_epochs': 35, 'num_layers': 2, 'num_topics': 10}
Trial #16  | pruned at step: 2            | params: {'batch_size': 32, 'beta': 0.9582, 'encoder_dropout': 0.1236, 'kl_strategy': 'cyclic', 'num_epochs': 28, 'num_layers': 3, 'num_topics': 40}
Trial #17  | pruned at step: 1            | params: {'batch_size': 128, 'beta': 0.9290, 'encoder_dropout': 0.0807, 'kl_strategy': 'cyclic', 'num_epochs': 33, 'num_layers': 2, 'num_topics': 21}
Trial #18  | completed, score: 1.8359e-01 | params: {'batch_size': 32, 'beta': 0.9035, 'encoder_dropout': 0.0461, 'kl_strategy': 'monotonic', 'num_epochs': 40, 'num_layers': 3, 'num_topics': 31}
Trial #19  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9013, 'encoder_dropout': 0.0538, 'kl_strategy': 'monotonic', 'num_epochs': 39, 'num_layers': 3, 'num_topics': 25}
Trial #20  | pruned at step: 1            | params: {'batch_size': 128, 'beta': 0.9114, 'encoder_dropout': 0.0568, 'kl_strategy': 'monotonic', 'num_epochs': 37, 'num_layers': 2, 'num_topics': 54}
Trial #21  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9265, 'encoder_dropout': 0.0725, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 20}
Trial #22  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9021, 'encoder_dropout': 0.0510, 'kl_strategy': 'monotonic', 'num_epochs': 37, 'num_layers': 3, 'num_topics': 27}
Trial #23  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9096, 'encoder_dropout': 0.0105, 'kl_strategy': 'monotonic', 'num_epochs': 39, 'num_layers': 3, 'num_topics': 38}
Trial #24  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9254, 'encoder_dropout': 0.0608, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 23}
Trial #25  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9001, 'encoder_dropout': 0.0359, 'kl_strategy': 'monotonic', 'num_epochs': 35, 'num_layers': 3, 'num_topics': 37}
Trial #26  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9231, 'encoder_dropout': 0.0820, 'kl_strategy': 'monotonic', 'num_epochs': 38, 'num_layers': 3, 'num_topics': 10}
Trial #27  | pruned at step: 4            | params: {'batch_size': 32, 'beta': 0.9108, 'encoder_dropout': 0.0615, 'kl_strategy': 'monotonic', 'num_epochs': 32, 'num_layers': 3, 'num_topics': 22}
Trial #28  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9216, 'encoder_dropout': 0.0880, 'kl_strategy': 'monotonic', 'num_epochs': 28, 'num_layers': 3, 'num_topics': 30}
Trial #29  | pruned at step: 1            | params: {'batch_size': 64, 'beta': 0.9072, 'encoder_dropout': 0.0535, 'kl_strategy': 'monotonic', 'num_epochs': 36, 'num_layers': 3, 'num_topics': 46}
Trial #30  | pruned at step: 1            | params: {'batch_size': 128, 'beta': 0.9400, 'encoder_dropout': 0.0698, 'kl_strategy': 'monotonic', 'num_epochs': 30, 'num_layers': 3, 'num_topics': 19}
Trial #31  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9132, 'encoder_dropout': 0.1349, 'kl_strategy': 'monotonic', 'num_epochs': 38, 'num_layers': 3, 'num_topics': 20}
Trial #32  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9249, 'encoder_dropout': 0.0650, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 25}
Trial #33  | completed, score: 1.8355e-01 | params: {'batch_size': 32, 'beta': 0.9277, 'encoder_dropout': 0.0316, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 23}
Trial #34  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9179, 'encoder_dropout': 0.0750, 'kl_strategy': 'monotonic', 'num_epochs': 40, 'num_layers': 3, 'num_topics': 28}
Trial #35  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9152, 'encoder_dropout': 0.0445, 'kl_strategy': 'monotonic', 'num_epochs': 23, 'num_layers': 3, 'num_topics': 15}
Trial #36  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9066, 'encoder_dropout': 0.0610, 'kl_strategy': 'monotonic', 'num_epochs': 36, 'num_layers': 3, 'num_topics': 11}
Trial #37  | pruned at step: 1            | params: {'batch_size': 64, 'beta': 0.9364, 'encoder_dropout': 0.0192, 'kl_strategy': 'monotonic', 'num_epochs': 32, 'num_layers': 3, 'num_topics': 24}
Trial #38  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9166, 'encoder_dropout': 0.1078, 'kl_strategy': 'monotonic', 'num_epochs': 37, 'num_layers': 3, 'num_topics': 19}
Trial #39  | pruned at step: 1            | params: {'batch_size': 64, 'beta': 0.9137, 'encoder_dropout': 0.1348, 'kl_strategy': 'monotonic', 'num_epochs': 26, 'num_layers': 3, 'num_topics': 16}
Trial #40  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9468, 'encoder_dropout': 0.1011, 'kl_strategy': 'monotonic', 'num_epochs': 29, 'num_layers': 3, 'num_topics': 20}
Trial #41  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9070, 'encoder_dropout': 0.0750, 'kl_strategy': 'monotonic', 'num_epochs': 38, 'num_layers': 3, 'num_topics': 35}
Trial #42  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9295, 'encoder_dropout': 0.0400, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 23}
Trial #43  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9257, 'encoder_dropout': 0.0291, 'kl_strategy': 'monotonic', 'num_epochs': 35, 'num_layers': 3, 'num_topics': 17}
Trial #44  | completed, score: 1.8356e-01 | params: {'batch_size': 32, 'beta': 0.9554, 'encoder_dropout': 0.0311, 'kl_strategy': 'monotonic', 'num_epochs': 34, 'num_layers': 3, 'num_topics': 22}
Trial #45  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9306, 'encoder_dropout': 0.0381, 'kl_strategy': 'cyclic', 'num_epochs': 32, 'num_layers': 3, 'num_topics': 24}
Trial #46  | pruned at step: 1            | params: {'batch_size': 128, 'beta': 0.9419, 'encoder_dropout': 0.0232, 'kl_strategy': 'monotonic', 'num_epochs': 39, 'num_layers': 3, 'num_topics': 33}
Trial #47  | pruned at step: 1            | params: {'batch_size': 32, 'beta': 0.9211, 'encoder_dropout': 0.0482, 'kl_strategy': 'monotonic', 'num_epochs': 33, 'num_layers': 3, 'num_topics': 23}
Trial #48  | pruned at step: 1            | params: {'batch_size': 64, 'beta': 0.9357, 'encoder_dropout': 0.0379, 'kl_strategy': 'monotonic', 'num_epochs': 31, 'num_layers': 3, 'num_topics': 14}
None

It seems I can simply run the tuning again and it will pick up and finish trials that were not run, but it does not repeat the trial where the issue occured (Trial #8). Is there a way to specifically repeat that trial to avoid having to run the entire tuning process all over again?

AllenWLynch commented 1 year ago

Hello, Does tuning not complete once a certain number of trials has been run? Failed trials are removed from the tuner's history and do not influence hyperparameter choices in subsequent trials. You may continue with the best model of those that did complete tuning since you've run a pretty comprehensive search. AL

jmmuncie commented 1 year ago

Hi Allen,

No, even after all trials have completed, the failed trials do not restart and it doesn't seem like an additional trial is run. (i.e. if I try to run 64 trials and 2 do not complete due to errors, MIRA tells me 64 trials have finished, but really only 62 have).

You're right that in my case, most trials have completed and I can move forward feeling fairly confident that the best (or nearly best) hyperparameters have been found. However, for the sake of being thorough, I was wondering if there was a generalizable solution to restart the models that do not complete due to error.

Best, Jon

cistrome / MIRA

Repeat individual tuning trials that terminate with errors? #22