Model Weights Saved in 0.5.3 may be incompatible with 0.6.0

ahatamiz commented 3 years ago

Describe the bug It seems that models trained and saved in monai 0.5.3 cannot be restored and used in 0.6.0. There is mis-match between the keys in different layers.

To Reproduce Clone the UNETR repository from (https://github.com/Project-MONAI/research-contributions/tree/master/UNETR/BTCV). Train a model in 0.5.3 and save the weights. Load the model in 0.6.0 using: ckpt = torch.load(args.ckpt) model.load_state_dict(ckpt['state_dict'])

Expected behavior Monai 0.6.0 version throws this error for mis-matched keys in loading the state_dict for the convolution layers of the network. For example, the following is a representative error:

RuntimeError: Error(s) in loading state_dict for UNETR:
        Unexpected key(s) in state_dict: "encoder1.layer.norm1.weight", "encoder1.layer.norm1.bias", "encoder1.layer.norm2.weight", "encoder1.layer.norm2.bias", "encoder1.layer.norm3.weight", "encoder1.layer.norm3.bias", "encoder10.layer.norm1.weight", "encoder10.layer.norm1.bias", "encoder10.layer.norm2.weight", "encoder10.layer.norm2.bias", "encoder10.layer.norm3.weight", "encoder10.layer.norm3.bias", "decoder5.conv_block.norm1.weight", "decoder5.conv_block.norm1.bias", "decoder5.conv_block.norm2.weight", "decoder5.conv_block.norm2.bias", "decoder5.conv_block.norm3.weight", "decoder5.conv_block.norm3.bias", "decoder4.conv_block.norm1.weight", "decoder4.conv_block.norm1.bias", "decoder4.conv_block.norm2.weight", "decoder4.conv_block.norm2.bias", "decoder4.conv_block.norm3.weight", "decoder4.conv_block.norm3.bias", "decoder3.conv_block.norm1.weight", "decoder3.conv_block.norm1.bias", "decoder3.conv_block.norm2.weight", "decoder3.conv_block.norm2.bias", "decoder3.conv_block.norm3.weight", "decoder3.conv_block.norm3.bias", "decoder2.conv_block.norm1.weight", "decoder2.conv_block.norm1.bias", "decoder2.conv_block.norm2.weight", "decoder2.conv_block.norm2.bias", "decoder2.conv_block.norm3.weight", "decoder2.conv_block.norm3.bias", "decoder1.conv_block.norm1.weight", "decoder1.conv_block.norm1.bias", "decoder1.conv_block.norm2.weight", "decoder1.conv_block.norm2.bias", "decoder1.conv_block.norm3.weight", "decoder1.conv_block.norm3.bias".

These are for the convolutional layers of the network In 0.5.3 version, you can simply check out for example model.decoder1.conv_block.norm3.bias and get the values ( even without loading the state_dict)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       requires_grad=True)

But in 0.6.0 , model.decoder1.conv_block.norm3.bias returns None.

If this is not fixed, it breaks the compatibility between 0.5.3 and 0.6.0 for restoring models.

yiheng-wang-nv commented 3 years ago

Hi @ahatamiz , in https://github.com/Project-MONAI/MONAI/pull/2166, UnetBasicBlock is changed to use monai.networks.layers.utils.get_norm_layer, which is more generic to adjust the normalization layers.

The old version (0.5.3) used affine=True for InstanceNorm3d layer, it is according to the original implementation of nnUNet. In this setting, the layers will have weight and bias.

But after changing in https://github.com/Project-MONAI/MONAI/pull/2166, if you still need to use affine=True, you have to config it via setting: norm_name=("instance", {"affine": True})

you can also refer to the changes in: https://github.com/Project-MONAI/MONAI/pull/2166/files#diff-0095d268268e7ee415e079c1f0ccabed72f2c722f92cb0780d2ba8133a088334R111 UnetResBlock and UnetBasicBlock are originally designed for DynUNet, and to ensure the consistency after the changes, the default value of norm_name has been changed into: norm_name: Union[Tuple, str] = ("INSTANCE", {"affine": True}).

ahatamiz commented 3 years ago

Hi @yiheng-wang-nv , yes, the solution you mentioned works. But it's a temporary work-around that few people maybe aware of doing. If you train your DynUnet model in 0.5.3 and want to restore the weights in 0.6.0, it does not work with the previous network definition.

I think as a first step, this information needs to be reported somewhere ( maybe a page for how to migrate to 0.6.0). Eventually, this needs to be fixed in later releases. The user needs to be able to use the instance norm with the same behavior in older versions without needing to specify affine=True.

This is not only for DynUnet network but UnetResBlock and UnetBasicBlock. These blocks are also useful for other custom network implementations.

yiheng-wang-nv commented 3 years ago

Hi @ahatamiz , train your DynUnet model in 0.5.3 and want to restore the weights in 0.6.0, it does not work this is not correct, since the layers are not changed in class DynUNet.

For UnetResBlock and UnetBasicBlock, the purpose of #2166 is to make the norm_name more generic, thus users can config more for norm layers. I've updated the docstring in #2826 , but I'm not sure how to ensure that:

The user needs to be able to use the instance norm with the same behavior in older versions without needing to specify affine=True

Hi @Nic-Ma @wyli , what do you think?

ahatamiz commented 3 years ago

Hi @yiheng-wang-nv , you set norm_name=("instance", {"affine": True}) as default in DynUnet and that's the reason it seems to work, but problem persists with UnetResBlock and UnetBasicBlock.

I see the argument for making norm_name generic, but that comes at the cost of breaking changes where previous models cannot be restored. If you use UnetResBlock, UnetBasicBlock or any conv layer with the instance normalization in 0.5.3 and try to restore the model in 0.6.0, it would not work, unless you are somehow aware to change norm_name=("instance", {"affine": True}). Why not making the conv layer behavior consistent ?

wyli commented 3 years ago

I think this ticket makes a good point, @yiheng-wang-nv perhaps we can update the migration guide to make it clear? as another workaround, is it possible to use copy_model_state to load v0.5.3 checkpoints? https://github.com/Project-MONAI/MONAI/blob/e95500ad1802e91da12f31350ba8e3b85dc36d11/monai/networks/utils.py#L339

In general, monai adopts the semantic versioning -- at this stage we are at major version zero (0.y.z), indicating an initial development. Anything may change at any time, the public API should not be considered stable.

ahatamiz commented 3 years ago

I think this ticket makes a good point, @yiheng-wang-nv perhaps we can update the migration guide to make it clear? as another workaround, is it possible to use copy_model_state to load v0.5.3 checkpoints?

https://github.com/Project-MONAI/MONAI/blob/e95500ad1802e91da12f31350ba8e3b85dc36d11/monai/networks/utils.py#L339

In general, monai adopts the semantic versioning -- at this stage we are at major version zero (0.y.z), indicating an initial development. Anything may change at any time, the public API should not be considered stable.

Hi @wyli , thank you. I think updating the migration guide is really beneficial.

yiheng-wang-nv commented 3 years ago

Hi @yiheng-wang-nv , you set norm_name=("instance", {"affine": True}) as default in DynUnet and that's the reason it seems to work, but problem persists with UnetResBlock and UnetBasicBlock.

I see the argument for making norm_name generic, but that comes at the cost of breaking changes where previous models cannot be restored. If you use UnetResBlock, UnetBasicBlock or any conv layer with the instance normalization in 0.5.3 and try to restore the model in 0.6.0, it would not work, unless you are somehow aware to change norm_name=("instance", {"affine": True}). Why not making the conv layer behavior consistent ?

or any conv layer with the instance normalization in 0.5.3 this is incorrect. Only blocks in dynunet_block.py are affected due to the limitation in this line

yiheng-wang-nv commented 3 years ago

MONAI/monai/networks/utils.py

Thanks @wyli , I've updated the guide.

Project-MONAI / MONAI

Model Weights Saved in 0.5.3 may be incompatible with 0.6.0 #2825