AssertionError - Githubissues

I run the train_p_diff.py with the model ConvNet-3 and the system 'ae_ddpm', the trainer-layer is all; However, it reports AssertionError when running test_g_model：

Test the AE model
latent shape:torch.Size([10, 4, 492])
ae params shape:torch.Size([10, 2048])
307591 2048
Error executing job with overrides: ['task=dermamnist', 'system=ae_ddpm', 'mode=train']
Traceback (most recent call last):
  File "/data/user3/meddistillation/NNDiffusion/train_p_diff.py", line 10, in training_for_data
    result = train_generation(config)
  File "/data/user3/meddistillation/NNDiffusion/core/runner/runner.py", line 67, in train_generation
    trainer.fit(system, datamodule=datamodule, ckpt_path=cfg.load_system_checkpoint)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_train
    self._run_sanity_check()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1411, in _run_sanity_check
    val_loop.run()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    output = self._evaluation_step(**kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 222, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/data/user3/meddistillation/NNDiffusion/core/system/ae_ddpm.py", line 86, in validation_step
    acc, test_loss, output_list = self.task_func(param)
  File "/data/user3/meddistillation/NNDiffusion/core/system/base.py", line 47, in task_func
    return self.task.test_g_model(input)
  File "/data/user3/meddistillation/NNDiffusion/core/tasks/classification.py", line 44, in test_g_model
    assert (target_num == params_num)
AssertionError

It seems that the target_num is 307591 but the params_num is 2048. What happens? How can I fix the bug?

Would you provide more detail about the experiment? I'm more than willing to help you with that. the error seems the target_num != params_num, the details code in core/tasks/classification.py 's test_g_model function. The params_num is generated parameter's num, and the target_num is replaced parameter's num in validation.

Would you provide more detail about the experiment? I'm more than willing to help you with that. the error seems the target_num != params_num, the details code in core/tasks/classification.py 's test_g_model function. The params_num is generated parameter's num, and the target_num is replaced parameter's num in validation.

Sure. The size of the dataset is 28×28 with 3 channels；The ConvNet-3 I use is below:

class ConvNet(nn.Module):
    def __init__(self, channel, num_classes, net_width, net_depth, net_act, net_norm, net_pooling, im_size = (28,28)):
        super(ConvNet, self).__init__()

        self.features, shape_feat = self._make_layers(channel, net_width, net_depth, net_norm, net_act, net_pooling, im_size)
        num_feat = shape_feat[0]*shape_feat[1]*shape_feat[2]
        self.classifier = nn.Linear(num_feat, num_classes)

    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
        return out

    def embed(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        return out

    def _get_activation(self, net_act):
        if net_act == 'sigmoid':
            return nn.Sigmoid()
        elif net_act == 'relu':
            return nn.ReLU(inplace=True)
        elif net_act == 'leakyrelu':
            return nn.LeakyReLU(negative_slope=0.01)
        elif net_act == 'swish':
            return Swish()
        else:
            exit('unknown activation function: %s'%net_act)

    def _get_pooling(self, net_pooling):
        if net_pooling == 'maxpooling':
            return nn.MaxPool2d(kernel_size=2, stride=2)
        elif net_pooling == 'avgpooling':
            return nn.AvgPool2d(kernel_size=2, stride=2)
        elif net_pooling == 'none':
            return None
        else:
            exit('unknown net_pooling: %s'%net_pooling)

    def _get_normlayer(self, net_norm, shape_feat):
        # shape_feat = (c*h*w)
        if net_norm == 'batchnorm':
            return nn.BatchNorm2d(shape_feat[0], affine=True)
        elif net_norm == 'layernorm':
            return nn.LayerNorm(shape_feat, elementwise_affine=True)
        elif net_norm == 'instancenorm':
            return nn.GroupNorm(shape_feat[0], shape_feat[0], affine=True)
        elif net_norm == 'groupnorm':
            return nn.GroupNorm(4, shape_feat[0], affine=True)
        elif net_norm == 'none':
            return None
        else:
            exit('unknown net_norm: %s'%net_norm)

    def _make_layers(self, channel, net_width, net_depth, net_norm, net_act, net_pooling, im_size):
        layers = []
        in_channels = channel
        # if im_size[0] == 28:
        #     im_size = (32, 32)
        shape_feat = [in_channels, im_size[0], im_size[1]]
        for d in range(net_depth):
            # layers += [nn.Conv2d(in_channels, net_width, kernel_size=3, padding=3 if channel == 1 and d == 0 else 1)]
            layers += [nn.Conv2d(in_channels, net_width, kernel_size=3, padding=1)]
            shape_feat[0] = net_width
            if net_norm != 'none':
                layers += [self._get_normlayer(net_norm, shape_feat)]
            layers += [self._get_activation(net_act)]
            in_channels = net_width
            if net_pooling != 'none':
                layers += [self._get_pooling(net_pooling)]
                shape_feat[1] //= 2
                shape_feat[2] //= 2

        return nn.Sequential(*layers), shape_feat

I set the train_layer = 'all' since I am not sure which layer is need to be finetuned. Now the target num is 307591, but the params_num is still 2048. In fact, when I change the ae_model.in_dim into 307591 in the config file, the program runs all right! Why? Can you help me?

The bug is because the autoencoder model and ddpm model don't adjust to parameter. In configs/system/ae_ddpm, the code about setting the backbone: ae_model: _target_: core.module.modules.encoder.medium in_dim: 2048 input_noise_factor: 0.001 latent_noise_factor: 0.5 model: arch: _target_: core.module.wrapper.ema.EMA model: _target_: core.module.modules.unet.AE_CNN_bottleneck in_channel: 1 in_dim: 12 The ae_model.in_dim must equal to target_num, and the (model.in_channel, model.in_dim) must equal to latent shape.

Glad to be able to help you. In fact, the in_dim is a hyperparameter in autoencoder, the detail code in core/module/modules/encoder.py. As you know, we use 1D convolutional layer to extract the feature in parameter, the in_dim is a parameter to build model, if not set correctly, it doesn't fit to parameter num. what's more, you must set correct (in_channel, in_dim) in unet model. If not, it will encounter a bug in the process that train unet for diffusion.

NUS-HPC-AI-Lab / Neural-Network-Parameter-Diffusion

AssertionError #7