resume/autoresume doesn't work

otherman16 commented 4 years ago

🐛 Bug Report

I am trying to resume training with catalyst-dl run --autoresume last from last epoch. For example from 60th epoch of 120. In previous version it works fine. But now catalyst loads only best checkpoint and starts training from the beginning.

=> Loading checkpoint /.../checkpoints/best.pth
loaded model checkpoint /.../checkpoints/best.pth
1/120 * Epoch (train):   0% 0/411 [00:00<?, ?it/s]
...

catalyst-dl run --resume=/path/to/checkpoint.pth doesn't work too.

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
TensorFlow version: 1.15.0
TensorBoard version: 2.2.1

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: No

Versions of relevant libraries:
[pip3] alchemy-catalyst==20.3
[pip3] catalyst==20.7
[pip3] efficientnet-pytorch==0.6.3
[pip3] numpy==1.19.1
[pip3] tensorboard==2.2.1
[pip3] tensorboard-plugin-wit==1.7.0
[pip3] tensorboardX==2.1
[pip3] tensorflow==1.15.0
[pip3] tensorflow-estimator==1.15.1
[pip3] torch==1.4.0
[pip3] torch2trt==0.1.0
[pip3] torchvision==0.5.0
[conda] alchemy-catalyst          20.3                     pypi_0    pypi
[conda] catalyst                  20.7                     pypi_0    pypi
[conda] cudatoolkit               10.1.243             h6bb024c_0    defaults
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2020.1                      217    defaults
[conda] numpy                     1.19.1           py37h8960a57_0    conda-forge
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] tensorboard               2.2.1                    pypi_0    pypi
[conda] tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
[conda] tensorboardx              2.1                      pypi_0    pypi
[conda] tensorflow                1.15.0                   pypi_0    pypi
[conda] tensorflow-estimator      1.15.1                   pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch

Scitator commented 4 years ago

Looks interesting, maybe @Ditwoo could help with it. Meanwhile, @otherman16 have you tried to investigate the issue by yourself? maybe you already found the solution)) Could you please write down the version without such bug?

otherman16 commented 4 years ago

I have tried to investigate this bug. I've found:

catalyst.core.callbacks.CheckpointCallback.load_on_stage_start flag is always None

Catalyst always trying to load best.pth even if autoresume have not been set catalyst.dl.experiment.config.ConfigExperiment._process_callbacks():

...
    for callback in callbacks.values():
        if isinstance(callback, CheckpointCallback):
            if callback.load_on_stage_start is None:
                callback.load_on_stage_start = "best"
            if (
                isinstance(callback.load_on_stage_start, dict)
                and "model" not in callback.load_on_stage_start
            ):
                callback.load_on_stage_start["model"] = "best"
...

catalyst.core.callbacks.CheckpointCallback._load_runner loads best.pth and flag need_load_full == False catalyst.core.callbacks.CheckpointCallback.on_stage_start:
```
...
        if self.load_on_stage_start is not None and checkpoint_exists:
            self._load_runner(
                runner,
                mapping=self.load_on_stage_start,
                load_full=need_load_full,
            )
...
```
I can't understand why flag catalyst.core.callbacks.CheckpointCallback.load_on_stage_start is NOT set.

In catalyst v20.04 autoresume works.

Scitator commented 4 years ago

looks like fixed :)

catalyst-team / catalyst

resume/autoresume doesn't work #904

🐛 Bug Report

Environment