Unable to resume training

danishnazir commented 1 year ago

Bug

Hi, I am trying to resume training on a pretrained model (https://github.com/micmic123/QmapCompression), which is based on compressAI. The pretrained model is based on Hyperprior architecture, with some additions.

To Reproduce

def load_checkpoint(path, model, optimizer=None, aux_optimizer=None, scaler=None, only_net=False):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    print(f'Loaded from {itr} iterations')

    model.load_state_dict(snapshot['model'])

    if not only_net:
        if 'optimizer' in snapshot:
            optimizer.load_state_dict(snapshot['optimizer'])
        if 'aux_optimizer' in snapshot:
            aux_optimizer.load_state_dict(snapshot['aux_optimizer'])
        if scaler is not None and 'scaler' in snapshot:
            scaler.load_state_dict(snapshot['scaler'])

    return itr, model


RuntimeError: Error(s) in loading state_dict for CustomDataParallel:
        size mismatch for module.entropy_bottleneck._offset: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._quantized_cdf: copying a param with shape torch.Size([192, 45]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.entropy_bottleneck._cdf_length: copying a param with shape torch.Size([192]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._offset: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._quantized_cdf: copying a param with shape torch.Size([64, 3133]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional._cdf_length: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for module.gaussian_conditional.scale_table: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([0]).

Expected behavior

should be easily load the model

Environment

Please copy and paste the output from python3 -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 470.141.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-gradcam==0.2.1
[pip3] pytorch-msssim==0.2.0
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.7.1+cu101
[pip3] torch-tb-profiler==0.4.0
[pip3] torchaudio==0.7.2
[pip3] torchvision==0.8.2+cu101
[conda] _pytorch_select           0.1                       cpu_0    anaconda
[conda] blas                      1.0                         mkl    anaconda
[conda] cudatoolkit               10.1.243             h6bb024c_0    anaconda
[conda] libmklml                  2019.0.5             h06a4308_0    anaconda
[conda] mkl                       2020.2                      256    anaconda
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch-gradcam           0.2.1                    pypi_0    pypi
[conda] pytorch-msssim            0.2.0                    pypi_0    pypi
[conda] pytorch-transformers      1.0.0                    pypi_0    pypi
[conda] torch                     1.7.1+cu101              pypi_0    pypi
[conda] torch-tb-profiler         0.4.0                    pypi_0    pypi
[conda] torchaudio                0.7.2                    pypi_0    pypi
[conda] torchvision               0.8.2+cu101              pypi_0    pypi

YodaEmbedding commented 1 year ago

What is the version and commit hash for your local compressAI repository?

danishnazir commented 1 year ago

Thanks for your response, can you please tell me how can i find commit hash for local compressAI repo? I installed compressAI using pip install compressai. The library version is 1.2.2, not sure about the commit hash, where to find.

YodaEmbedding commented 1 year ago

Can you please show us the output of:

COMPRESSAI_PATH="$(python -c 'import compressai; print(compressai.__path__[0])')"
echo "$COMPRESSAI_PATH"
cd "$COMPRESSAI_PATH"
git rev-parse HEAD

It sounds like you installed compressai from PyPI, so that means my recent commits https://github.com/InterDigitalInc/CompressAI/commit/b64b0daf0a62a6dc38eb8768fcada074ce19f6a8 and https://github.com/InterDigitalInc/CompressAI/commit/14ac02c5182cbfee596abdfea98886be6247479a are probably not the cause of the problem. The issue is that module.entropy_bottleneck buffers are not being pre-allocated with enough space since it's expecting entropy_bottleneck directly. Good news: the recent commits might actually fix the problem! Consider installing compressai from source instead:

cd ~
git clone https://github.com/InterDigitalInc/CompressAI compressai
cd compressai
pip install -U pip && pip install -e .

Alternatively, you can also just copy paste the new load_state_dict function into CompressionModel, defined here:

https://github.com/InterDigitalInc/CompressAI/blob/14ac02c5182cbfee596abdfea98886be6247479a/compressai/models/base.py#L62-L142

danishnazir commented 1 year ago

Hi, Thank you for your detailed answer. Yes you are right, I am not building compressAI from the source. The requested output is as follows: COMPRESSAI_PATH = /anaconda/envs/azureml_py38/lib/python3.8/site-packages/compressai.

As for the proposed solution. My CompressionModel class already looks the same as you have mentioned. I copied it earlier, since there was some issues with Multi-GPU training and copying it worked for me. Please look at my project over here Entropy Models/ Hyperprior Files I think the issue arises from using multiple versions at one time? I use Pypi to install compressai, but I redefine the files e.g. entropy_models.py again in the code, which might be different from the original pypi version. Could this be a problem?

YodaEmbedding commented 1 year ago

DataParallel adds a module. prefix by default to every key in the parallel_model.state_dict().

Solutions:

1) Save the "non-parallel" model:

module = model.module if isinstance(model, DataParallel) else model
state_dict = module.state_dict()
torch.save("output.pth", state_dict)

2) Load checkpoint, rename all the keys, save new checkpoint:

ckpt = torch.load("input.pth")
print(ckpt.keys())
sd = "state_dict"  # I forgot what it was called.
print("\n".join(ckpt[sd].keys()))
ckpt[sd] = {k.removeprefix("module."): v for k, v in ckpt[sd].items()}
torch.save("output.pth", ckpt)

3) Same as (2), but do it before loading the state_dict instead.

4) Load the model weights before wrapping it in DataParallel.

I would say (1) is the best and least likely to cause problems in the future, and maybe do (4) as well.

danishnazir commented 1 year ago

Yeah you were right. Everything works now. I am attaching my code. in case if someone else face a similar problem.

def load_checkpoint(path, model):
    snapshot = torch.load(path)
    itr = snapshot['itr']
    dict_ = {}
    print(f'Loaded from {itr} iterations')

    for k, v in snapshot["model"].items():

        k = remove_prefix(k,"module.")
        dict_[k] = v
    snapshot["model"] = dict_
    model.load_state_dict(snapshot['model'])`

and in train.py, we have

model = model.to(device)
optimizer,aux_optimizer = configure_optimizers(model,config)
if args.resume:
    itr, model = load_checkpoint(args.resume, model)
    logger.load_itr(itr)

if torch.cuda.device_count() > 1:
    model = CustomDataParallel(model)

InterDigitalInc / CompressAI