facebookresearch / silk

SiLK (Simple Learned Keypoint) is a self-supervised deep learning keypoint model.
GNU General Public License v3.0
643 stars 58 forks source link

where is the trained model located, and how can i change this? #13

Closed facundolazcano closed 1 year ago

facundolazcano commented 1 year ago

Hi, first thanks for your great work and research.

I try to train the silk with custom dataset, but i dont know where the save checkpoints are located in the train script.

And where in the code or configuration can I change this save path?

Yours truly

gleize commented 1 year ago

Hi @facundolazcano,

All training sessions are run in a timestamped folder specified here. This is a feature of the hydra library. So your saved checkpoints and tensorboard files should be situated under ./var/silk-cli/run/....

Additionally, the path of the saved checkpoint should be displayed in the logs at the end of the training session.

mhwang003 commented 1 year ago

Hi @gleize,

I have a question regarding how the trained models are saved. I ran the training code and looking at the tensorboard logs the training seems to have completed successfully, yet I got logs as shown in the image below.

  1. What might the version mean?
  2. It seems that not all of the models from each epoch were saved. Is there a reason for this or is this an error on my side? Screen Shot 2023-06-08 at 13 42 07
Screen Shot 2023-06-15 at 11 54 27
gleize commented 1 year ago

Hi @mhwang003,

  1. What might the version mean?

Since you've changed the output directory, we have a different structure. However, I believe those version_X folders are generated by PyTorch Lightning to avoid overwriting previously trained checkpoints. You can check that by looking at the files timestamps.

  1. It seems that not all of the models from each epoch were saved. Is there a reason for this or is this an error on my side?

We do use a PyTorch Lightning callback to save the best 10 checkpoints (evaluated on the validation set). If you search for pytorch_lightning.callbacks.ModelCheckpoint in the codebase, you will find the config files where that option is specified (usually in etc/mode/train-xxx.yaml files). Once you find the config block, you can modify the save_top_k option to increase the number of checkpoints saved (e.g. here for SiLK).

gleize commented 1 year ago

Answers added to FAQ. Closing now.