Closed OrianeN closed 11 months ago
It seems that the CustomCheckpointIO
instance custom_ckpt
was not used. Pytorch-lightning documentation) suggests to pass it to the Trainer constructor:
custom_ckpt = CustomCheckpointIO()
[...]
trainer = pl.Trainer(
[...]
callbacks=[
lr_callback,
grad_norm_callback,
checkpoint_callback,
GradientAccumulationScheduler({0: config.accumulate_grad_batches}),
],
plugins=[custom_ckpt]
)
Adding that to train.py changed the output slightly, as the prints of the custom loading function are now printed:
[...]
Restoring states from the checkpoint path at /nas-labs/OCR/experiments/nougat-exp/nougat/models/0.1.0-base/pytorch_model.bin
path: /.../nougat/models/0.1.0-base/pytorch_model.bin False
Custom loaded ckpt, adding pl version
Traceback (most recent call last):
File "/.../nougat/train.py", line 239, in <module>
train(config)
File "/.../nougat/train.py", line 209, in train
trainer.fit(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 946, in _run
self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 399, in _restore_modules_and_callbacks
self.resume_start(checkpoint_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 84, in resume_start
self._loaded_checkpoint = _pl_migrate_checkpoint(loaded_checkpoint, checkpoint_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/migration/utils.py", line 142, in _pl_migrate_checkpoint
old_version = _get_version(checkpoint)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/migration/utils.py", line 163, in _get_version
return checkpoint["pytorch-lightning_version"]
KeyError: 'pytorch-lightning_version'
Update: specifying the version with ckpt["pytorch-lightning_version"] = pl.__version__
in CustomCheckpointIO.load_checkpoint()
worked to load the model, yet afterwards I got this error:
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to
ModelCheckpoint.save_weights_only
being set toTrue
.'
I guess the released model isn't meant to be fine-tuned, so I'll close this issue.
I'm trying to run a dummy fine-tuning with a dataset I've created using only your ArXiv paper as sample (train = val, this is just for trying out the training pipeline).
I've created a config YAML file that looks like:
Then I've launched
python3 nougat/train.py --config train_nougat.yaml --debug
, but I got aKeyError: 'pytorch-lightning_version'
:I have
pytorch-lightning
version "2.0.9.post0", yet I've tried with other versions as well (e.g. 2.0.0) and still got the same issue. Nougat-ocr version: 0.1.17I've tried to add a version in the state_dict of the checkpoint, following this PyTorch discussion, but I still get the same error. Here's the modified function in train.py: