JanFreise commented 2 years ago

i enabled more variations for gridsearch and the last step of training crashes now (started this twice in a row to confirm). Before it worked just fine, when i set n_max_config=1 handing over 1 model to step 2 of the training.

Error-Message: DistutilsFileError: could not create './logs/my_model/best_model/config.json': No such file or directory.

searcher = GridSearcher( checkpoint_dir = current_checkpoint, local_dataset=local_dataset, model="PretrainedLM", epoch=10, epoch_partial=3, n_max_config=3, batch_size=64, # are the texts auto-chunked or is the rest of the sequence just being discarded? gradient_accumulation_steps=[4, 8], crf=[True, False], lr=[1e-4, 1e-5], weight_decay=[1e-7], random_seed=[42], lr_warmup_step_ratio=[0.1], max_grad_norm=[10]

gradient_accumulation_steps=[4],

crf=[True],

lr=[1e-4, 1e-5],

weight_decay=[None],

random_seed=[42],

lr_warmup_step_ratio=[0.1],

max_grad_norm=[None],

use_auth_token=True

) searcher.train()

These are the last log entries:

2022-10-02 19:36:21 INFO tmp metric: 0.7093469910371318 2022-10-02 19:36:21 INFO finish 3rd phase (no improvement) 2022-10-02 19:36:21 INFO 3rd RUN RESULTS: ./logs/my_model/model_cghqta 2022-10-02 19:36:21 INFO epoch 10: 0.6972361809045226 2022-10-02 19:36:21 INFO epoch 11: 0.6990415335463258 2022-10-02 19:36:21 INFO epoch 12: 0.6998706338939199 2022-10-02 19:36:21 INFO epoch 13: 0.7048969072164948 2022-10-02 19:36:21 INFO epoch 14: 0.707613563659629 2022-10-02 19:36:21 INFO epoch 15: 0.708893154190659 2022-10-02 19:36:21 INFO epoch 16: 0.7103403982016699 2022-10-02 19:36:21 INFO epoch 17: 0.7103403982016699 2022-10-02 19:36:21 INFO epoch 18: 0.7093469910371318

Referring to the documentation: "The best model in the second stage will continue fine-tuning till the validation metric get decreased."

This brings up the question how is training to Epoch "l" being handeled? If the learning curve shows overfitting before the configured maximum number of epochs for step 2 (e.g. 10). Does it regonize and stop before epoch 10 or does it just continue handing over an overfitting model for furthermore training of "best_model"?

asahi417 commented 2 years ago

That's a good point, and I don't think I explain it in the README in detail. The first step will train all the configuration until the epoch_partial and the n_max_config-best configurations will be handed to the second step, where those models will be trained until epoch. Once the second step finished, we compute the loss on the validation set with each epoch's checkpoint. Only if the best epoch is epoch, meaning there's a possibility that the model is still being underrepresented, we continue the fine-tuning.