TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.24k stars 243 forks source link

problem with saving the model checkpoint #63

Closed tonytu16 closed 4 years ago

tonytu16 commented 4 years ago

Hello, When I tried to save a model to the designated path, I get an "checkpoint metric is not available error". So I repulled the repo and tried training on KITTI_tiny dataset; the model seems to train properly and I don't get the "checkpoint metric is not available" error, but I don't see the checkpoint file being saved to the path I designated in line 22 in default_config.py. Could you help me with this? Thank you very much!

Screen Shot 2020-07-31 at 11 35 50 PM

Screen Shot 2020-07-31 at 11 52 21 PM

VitorGuizilini-TRI commented 4 years ago

Can you share your .yaml file, or at least the checkpoint part? This error indicates that you are trying to monitor a metric that does not exist.

tonytu16 commented 4 years ago

Hello,

Thank you for your reply! I pulled the repo and ran a simple training test on the kitti_tiny dataset. I put the kitti_tiny folder in packnet-sfm/data/datasets and added a cfg.checkpoint.filepath in default_config.py for checkpoint saving. Those two are the only changes I made. Thank you!

Screen Shot 2020-08-02 at 3 20 33 PM

Screen Shot 2020-08-02 at 3 22 49 PM

soheilAppear commented 4 years ago

This saving issue happened for me too. I believe there might be some reason behind it. One the one is maybe your graphic memory is full. I'm not sure but try to reboot the system and start just the training process and see if its saving checkpoints after several epochs.

soheilAppear commented 4 years ago

https://github.com/TRI-ML/packnet-sfm/issues/54

check this one too. You did not define the directory for your checkpoints. go with this instruction in above link.

VitorGuizilini-TRI commented 4 years ago

You are monitoring "loss", that is not a valid metric. Try something like "abs_rel_pp_gt", it should work.