TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.24k stars 243 forks source link

Where is my trained model will be saved? #104

Closed zsteve2529 closed 3 years ago

zsteve2529 commented 3 years ago

Hi All and Authors of this great work,

I have spent 10 days training the packet-sfm on a KITTI dataset using YAML file (not the checkpoint), and the training was successful. However, I can not figure out where the train.py script saves the trained models?

Yes, I know it is a simple question and I should read the code, but I did (also I am not super familiar with Python). So, could someone please point out to me where is the training data saved?

VitorGuizilini-TRI commented 3 years ago

You need to set the checkpoint path in the .yaml file (there are some other options as well), otherwise the checkpoint won't be saved, unfortunately. Let me know if you have any other questions, so we can make sure your next model is properly saved!

zsteve2529 commented 3 years ago

@VitorGuizilini-TRI - Thank you very much for the prompt answer.

Perhaps, my confusion is that I was under the assumption if I train it will produce a new brand new checkpoint file?

Are the users of packnet-sfm supposed to download the pre-trained models (or checkpoint files) and improve upon on them? I was under the assumption that I could reproduce my own version of a brand new checkpoint file after I train?

Here is my current YAML file. Please advise. and thank you so much. I do not have lots of GPUs, so my current training takes long time. I need to do it right this time. What I'd like to do is, I want to save the training results.

`

checkpoint:

Folder where .ckpt files will be saved during training

filepath: '../../data/'

model: name: 'SelfSupModel' optimizer: name: 'Adam' depth: lr: 0.0002 pose: lr: 0.0002 scheduler: name: 'StepLR' step_size: 30 gamma: 0.5 depth_net: name: 'PackNet01' version: '1A' pose_net: name: 'PoseNet' version: '' params: crop: 'garg' min_depth: 0.0 max_depth: 80.0 datasets: augmentation: image_shape: (192, 640) train: batch_size: 4 dataset: ['KITTI'] path: ['../../data/datasets/KITTI_raw'] split: ['data_splits/eigen_zhou_files.txt'] depth_type: ['velodyne'] repeat: [2] validation: dataset: ['KITTI'] path: ['../../data/datasets/KITTI_raw'] split: ['data_splits/eigen_val_files.txt', 'data_splits/eigen_test_files.txt'] depth_type: ['velodyne'] test: dataset: ['KITTI'] path: ['../../data/datasets/KITTI_raw'] split: ['data_splits/eigen_test_files.txt'] depth_type: ['velodyne']

`

zsteve2529 commented 3 years ago

In the above, the format of YAML file is messed up for some reason.

VitorGuizilini-TRI commented 3 years ago

You can definitely train new models from scratch, when I say checkpoint path I mean the path where the checkpoint will be saved. I agree that the names might be a little confusing, I will look into making this more clear in the future.

You seem to be doing it right, your checkpoint.filepath is set, so that's there new models will be saved. One thing you can try is set an absolute path, instead of relative. There are some other options you can use:

checkpoint:
    filepath: '/data/experiments' # Where the models will be saved
    monitor: 'abs_rel_pp_gt' # which metric is observed
    monitor_index: 0 # from which validation dataset the metric is observed
    mode: 'min' # if the metric is minimized or maximized

A good practice is to use KITTI_tiny first and run for one epoch, only to see if a model is saved. If it's working properly then you can start a full training session. I hope this works for you!

manavsingh415 commented 3 years ago

Along with 'abs_rel_pp_gt', what other metrics can be used? What are all the possible strings we can give to checkpoint.monitor? Thanks