NUS-HPC-AI-Lab / Neural-Network-Parameter-Diffusion

We introduce a novel approach for parameter generation, named neural network parameter diffusion (p-diff), which employs a standard latent diffusion model to synthesize a new set of parameters
787 stars 38 forks source link

AssertionError #7

Closed nightrain-vampire closed 3 months ago

nightrain-vampire commented 4 months ago

I run the train_p_diff.py with the model ConvNet-3 and the system 'ae_ddpm', the trainer-layer is all; However, it reports AssertionError when running test_g_model:

Test the AE model
latent shape:torch.Size([10, 4, 492])
ae params shape:torch.Size([10, 2048])
307591 2048
Error executing job with overrides: ['task=dermamnist', 'system=ae_ddpm', 'mode=train']
Traceback (most recent call last):
  File "/data/user3/meddistillation/NNDiffusion/train_p_diff.py", line 10, in training_for_data
    result = train_generation(config)
  File "/data/user3/meddistillation/NNDiffusion/core/runner/runner.py", line 67, in train_generation
    trainer.fit(system, datamodule=datamodule, ckpt_path=cfg.load_system_checkpoint)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_train
    self._run_sanity_check()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1411, in _run_sanity_check
    val_loop.run()
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 153, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 127, in advance
    output = self._evaluation_step(**kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 222, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1763, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/data/user3/miniconda3/envs/distill/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/data/user3/meddistillation/NNDiffusion/core/system/ae_ddpm.py", line 86, in validation_step
    acc, test_loss, output_list = self.task_func(param)
  File "/data/user3/meddistillation/NNDiffusion/core/system/base.py", line 47, in task_func
    return self.task.test_g_model(input)
  File "/data/user3/meddistillation/NNDiffusion/core/tasks/classification.py", line 44, in test_g_model
    assert (target_num == params_num)
AssertionError

It seems that the target_num is 307591 but the params_num is 2048. What happens? How can I fix the bug?

1zeryu commented 4 months ago

Would you provide more detail about the experiment? I'm more than willing to help you with that. the error seems the target_num != params_num, the details code in core/tasks/classification.py 's test_g_model function. The params_num is generated parameter's num, and the target_num is replaced parameter's num in validation.

nightrain-vampire commented 4 months ago

Would you provide more detail about the experiment? I'm more than willing to help you with that. the error seems the target_num != params_num, the details code in core/tasks/classification.py 's test_g_model function. The params_num is generated parameter's num, and the target_num is replaced parameter's num in validation.

Sure. The size of the dataset is 28×28 with 3 channels;The ConvNet-3 I use is below:

class ConvNet(nn.Module):
    def __init__(self, channel, num_classes, net_width, net_depth, net_act, net_norm, net_pooling, im_size = (28,28)):
        super(ConvNet, self).__init__()

        self.features, shape_feat = self._make_layers(channel, net_width, net_depth, net_norm, net_act, net_pooling, im_size)
        num_feat = shape_feat[0]*shape_feat[1]*shape_feat[2]
        self.classifier = nn.Linear(num_feat, num_classes)

    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
        return out

    def embed(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        return out

    def _get_activation(self, net_act):
        if net_act == 'sigmoid':
            return nn.Sigmoid()
        elif net_act == 'relu':
            return nn.ReLU(inplace=True)
        elif net_act == 'leakyrelu':
            return nn.LeakyReLU(negative_slope=0.01)
        elif net_act == 'swish':
            return Swish()
        else:
            exit('unknown activation function: %s'%net_act)

    def _get_pooling(self, net_pooling):
        if net_pooling == 'maxpooling':
            return nn.MaxPool2d(kernel_size=2, stride=2)
        elif net_pooling == 'avgpooling':
            return nn.AvgPool2d(kernel_size=2, stride=2)
        elif net_pooling == 'none':
            return None
        else:
            exit('unknown net_pooling: %s'%net_pooling)

    def _get_normlayer(self, net_norm, shape_feat):
        # shape_feat = (c*h*w)
        if net_norm == 'batchnorm':
            return nn.BatchNorm2d(shape_feat[0], affine=True)
        elif net_norm == 'layernorm':
            return nn.LayerNorm(shape_feat, elementwise_affine=True)
        elif net_norm == 'instancenorm':
            return nn.GroupNorm(shape_feat[0], shape_feat[0], affine=True)
        elif net_norm == 'groupnorm':
            return nn.GroupNorm(4, shape_feat[0], affine=True)
        elif net_norm == 'none':
            return None
        else:
            exit('unknown net_norm: %s'%net_norm)

    def _make_layers(self, channel, net_width, net_depth, net_norm, net_act, net_pooling, im_size):
        layers = []
        in_channels = channel
        # if im_size[0] == 28:
        #     im_size = (32, 32)
        shape_feat = [in_channels, im_size[0], im_size[1]]
        for d in range(net_depth):
            # layers += [nn.Conv2d(in_channels, net_width, kernel_size=3, padding=3 if channel == 1 and d == 0 else 1)]
            layers += [nn.Conv2d(in_channels, net_width, kernel_size=3, padding=1)]
            shape_feat[0] = net_width
            if net_norm != 'none':
                layers += [self._get_normlayer(net_norm, shape_feat)]
            layers += [self._get_activation(net_act)]
            in_channels = net_width
            if net_pooling != 'none':
                layers += [self._get_pooling(net_pooling)]
                shape_feat[1] //= 2
                shape_feat[2] //= 2

        return nn.Sequential(*layers), shape_feat

I set the train_layer = 'all' since I am not sure which layer is need to be finetuned. Now the target num is 307591, but the params_num is still 2048. In fact, when I change the ae_model.in_dim into 307591 in the config file, the program runs all right! Why? Can you help me?

1zeryu commented 4 months ago

The bug is because the autoencoder model and ddpm model don't adjust to parameter. In configs/system/ae_ddpm, the code about setting the backbone: ae_model: _target_: core.module.modules.encoder.medium in_dim: 2048 input_noise_factor: 0.001 latent_noise_factor: 0.5 model: arch: _target_: core.module.wrapper.ema.EMA model: _target_: core.module.modules.unet.AE_CNN_bottleneck in_channel: 1 in_dim: 12 The ae_model.in_dim must equal to target_num, and the (model.in_channel, model.in_dim) must equal to latent shape.

1zeryu commented 4 months ago

Glad to be able to help you. In fact, the in_dim is a hyperparameter in autoencoder, the detail code in core/module/modules/encoder.py. As you know, we use 1D convolutional layer to extract the feature in parameter, the in_dim is a parameter to build model, if not set correctly, it doesn't fit to parameter num. what's more, you must set correct (in_channel, in_dim) in unet model. If not, it will encounter a bug in the process that train unet for diffusion.