catalyst-team / catalyst

Accelerated deep learning R&D
https://catalyst-team.com
Apache License 2.0
3.29k stars 389 forks source link

resume/autoresume doesn't work #904

Closed otherman16 closed 4 years ago

otherman16 commented 4 years ago

🐛 Bug Report

I am trying to resume training with catalyst-dl run --autoresume last from last epoch. For example from 60th epoch of 120. In previous version it works fine. But now catalyst loads only best checkpoint and starts training from the beginning.

=> Loading checkpoint /.../checkpoints/best.pth
loaded model checkpoint /.../checkpoints/best.pth
1/120 * Epoch (train):   0% 0/411 [00:00<?, ?it/s]
...

catalyst-dl run --resume=/path/to/checkpoint.pth doesn't work too.

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
TensorFlow version: 1.15.0
TensorBoard version: 2.2.1

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: No

Versions of relevant libraries:
[pip3] alchemy-catalyst==20.3
[pip3] catalyst==20.7
[pip3] efficientnet-pytorch==0.6.3
[pip3] numpy==1.19.1
[pip3] tensorboard==2.2.1
[pip3] tensorboard-plugin-wit==1.7.0
[pip3] tensorboardX==2.1
[pip3] tensorflow==1.15.0
[pip3] tensorflow-estimator==1.15.1
[pip3] torch==1.4.0
[pip3] torch2trt==0.1.0
[pip3] torchvision==0.5.0
[conda] alchemy-catalyst          20.3                     pypi_0    pypi
[conda] catalyst                  20.7                     pypi_0    pypi
[conda] cudatoolkit               10.1.243             h6bb024c_0    defaults
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2020.1                      217    defaults
[conda] numpy                     1.19.1           py37h8960a57_0    conda-forge
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] tensorboard               2.2.1                    pypi_0    pypi
[conda] tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
[conda] tensorboardx              2.1                      pypi_0    pypi
[conda] tensorflow                1.15.0                   pypi_0    pypi
[conda] tensorflow-estimator      1.15.1                   pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch
Scitator commented 4 years ago

Looks interesting, maybe @Ditwoo could help with it. Meanwhile, @otherman16 have you tried to investigate the issue by yourself? maybe you already found the solution)) Could you please write down the version without such bug?

otherman16 commented 4 years ago

I have tried to investigate this bug. I've found:

  1. catalyst.core.callbacks.CheckpointCallback.load_on_stage_start flag is always None
  2. Catalyst always trying to load best.pth even if autoresume have not been set catalyst.dl.experiment.config.ConfigExperiment._process_callbacks():
    ...
        for callback in callbacks.values():
            if isinstance(callback, CheckpointCallback):
                if callback.load_on_stage_start is None:
                    callback.load_on_stage_start = "best"
                if (
                    isinstance(callback.load_on_stage_start, dict)
                    and "model" not in callback.load_on_stage_start
                ):
                    callback.load_on_stage_start["model"] = "best"
    ...
  3. catalyst.core.callbacks.CheckpointCallback._load_runner loads best.pth and flag need_load_full == False catalyst.core.callbacks.CheckpointCallback.on_stage_start:
    ...
            if self.load_on_stage_start is not None and checkpoint_exists:
                self._load_runner(
                    runner,
                    mapping=self.load_on_stage_start,
                    load_full=need_load_full,
                )
    ...

    I can't understand why flag catalyst.core.callbacks.CheckpointCallback.load_on_stage_start is NOT set.

In catalyst v20.04 autoresume works.

Scitator commented 4 years ago

looks like fixed :)