Closed facundolazcano closed 1 year ago
Hi @facundolazcano,
All training sessions are run in a timestamped folder specified here. This is a feature of the hydra library.
So your saved checkpoints and tensorboard files should be situated under ./var/silk-cli/run/...
.
Additionally, the path of the saved checkpoint should be displayed in the logs at the end of the training session.
Hi @gleize,
I have a question regarding how the trained models are saved. I ran the training code and looking at the tensorboard logs the training seems to have completed successfully, yet I got logs as shown in the image below.
Hi @mhwang003,
- What might the version mean?
Since you've changed the output directory, we have a different structure. However, I believe those version_X
folders are generated by PyTorch Lightning to avoid overwriting previously trained checkpoints. You can check that by looking at the files timestamps.
- It seems that not all of the models from each epoch were saved. Is there a reason for this or is this an error on my side?
We do use a PyTorch Lightning callback to save the best 10 checkpoints (evaluated on the validation set). If you search for pytorch_lightning.callbacks.ModelCheckpoint
in the codebase, you will find the config files where that option is specified (usually in etc/mode/train-xxx.yaml
files). Once you find the config block, you can modify the save_top_k
option to increase the number of checkpoints saved (e.g. here for SiLK).
Answers added to FAQ. Closing now.
Hi, first thanks for your great work and research.
I try to train the silk with custom dataset, but i dont know where the save checkpoints are located in the train script.
And where in the code or configuration can I change this save path?
Yours truly