Cannot load checkpoint from Git-LFS

MarvinHarms commented 1 year ago

Hi :)

I am able to run train_turtlebot_lidar.py successfully, but when I try to run the evaluation "eval_turtlebot.py", I get the following errors:

Traceback (most recent call last): File "/home/neural_clbf/evaluation/eval_turtlebot.py", line 44, in <module> plot_turtlebot() File "/home/neural_clbf/evaluation/eval_turtlebot.py", line 28, in plot_turtlebot neural_controller = NeuralCLBFController.load_from_checkpoint( File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 135, in load_from_checkpoint checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage) File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/pytorch_lightning/utilities/cloud_io.py", line 33, in load return torch.load(f, map_location=map_location) File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'.

I tried to download the checkpoint file manually from Git LFS using this post, but the quota for LFS seems to have reached its' limit and I was not able to download the file.

As a temporary workaround, I tried to load the checkpoint created when running train_turtlebot_lidar.py but the operation fails on loading the checkpoint:

Traceback (most recent call last): File "/home/neural_clbf/evaluation/eval_turtlebot.py", line 45, in <module> eval_turtlebot() File "/home/neural_clbf/evaluation/eval_turtlebot.py", line 16, in eval_turtlebot neural_controller = NeuralCLBFController.load_from_checkpoint( File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint model = cls._load_model_state(checkpoint, strict=strict, **kwargs) File "/home/usr/miniconda3/envs/neural_clbf/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 199, in _load_model_state model = cls(**_cls_kwargs) TypeError: __init__() missing 1 required positional argument: 'scenarios'

Is this expected (and I'm the idiot) or is this a bug?

dawsonc commented 1 year ago

This is a bug. Actually, it's laziness/cheapness on my part. We used LFS when we were actively training new examples (and it was just us using the code) so we could get away with the free LFS quota. I don't think LFS is the right choice for a public-facing project, since there would be too much demand on our quota, so I've switched the (now pretty stable) models over to regular git file storage and removed LFS. I'll update the documentation to reflect this, but could you please re-try cloning the repo (delete the folder and re-clone) and let me know if it works? Thanks!

stonkens commented 1 year ago

Hi @dawsonc I'm running into the same issue. The saved models from the review folder are still in ckpt format and produce the same error as @MarvinHarms, both when trying to do train_linear_satellite.py directly (pickle error) and the GIT LFS storage quota exceeded when trying to extract the git LFS file

dawsonc commented 1 year ago

@stonkens I think these are two different bugs (ckpt is independent of LFS/git, it's just the PyTorch Lightning save format). LFS should be disabled for this repository. I can clone from scratch with no LFS problems on my laptop. Can you share your LFS error here and open a separate issue for the pickle error?

dawsonc commented 1 year ago

I was able to reproduce both the LFS bug and the pickle error. Turns out they WERE related after all :upside_down_face: --- when I migrated from LFS to normal files a while ago (to keep up with the # of people trying to download the models), it seems like the migration wasn't fully complete and so the saved models were corrupted. I re-did the migration, fixed both of these issues, and there should be a PR coming shortly that fixes this :)

dawsonc commented 1 year ago

This should be resolved as of PR #16, but please re-open if the issue persists and I'll be happy to help.

MIT-REALM / neural_clbf

Cannot load checkpoint from Git-LFS #13