clementchadebec / benchmark_VAE

Unifying Variational Autoencoder (VAE) implementations in Pytorch (NeurIPS 2022)
Apache License 2.0
1.79k stars 162 forks source link

Customize models for different image sizes #91

Open 1180597-JoaoCampos opened 1 year ago

1180597-JoaoCampos commented 1 year ago

Hello I was trying your library for my master dissertation. My dataset contains 128x128 images and I need to adapt the code to run some of your model archictures. I successfully did this with VAE but I'm having some issues adapting VAEGAN and RAE-L2 (mat1 and mat2 shapes cannot be multiplied (64x65536 and 1024x1)). Do you have any example on how to adapt different models to the image size I'm using?

I would really like to run this benchmark on my project data and see the different results for each one. Thank you for your amazing work. Regards.

clementchadebec commented 1 year ago

Hi @1180597-JoaoCampos,

Thank you for your interest in the library. I am happy to see that it is useful for your research.

Just so I better understand what you did to make this work for the VAE, Did you pass your own encoder and decoder to the model? Or only rely on the networks built automatically?

For the VAEGAN model , this issue seems to come from the discriminator network. Did you try to provide a custom one?

Best,

Clément

1180597-JoaoCampos commented 1 year ago

Due to the input difference, in my case 128x128 and yours 28x28 (I followed the mnist example), I looked for the encoders and decoders in your library and I modified some layers so that the autoencoder was prepared for my input.

For VAE it worked fine, however for the RAE-L2 and VAEGAN it didn't. Giving the example of the RAE-L2 :

 class Encoder(nn.Module):

    def __init__(self, encoded_space_dim):
        super().__init__()
        self.input_dim = (1, 128, 128)  //here is one of the changes, you have (1,28,28) 
        self.latent_dim = encoded_space_dim
        self.n_channels = 1

        layers = nn.ModuleList()

        layers.append(nn.Sequential(nn.Conv2d(self.n_channels, 64, 4, 2, padding=1)))

        layers.append(nn.Sequential(nn.Conv2d(64, 128, 4, 2, padding=1)))

        layers.append(nn.Sequential(nn.Conv2d(128, 128, 3, 2, padding=1)))

        layers.append(
            nn.Sequential(
                ResBlock(in_channels=128, out_channels=32),
                ResBlock(in_channels=128, out_channels=32),
            )
        )

        self.layers = layers
        self.depth = len(layers)

        self.embedding = nn.Linear(128 * 16 * 16, encoded_space_dim)  //Here is another diference, you have 128 * 4 * 4

    def forward(self, x, output_layer_levels: List[int] = None):

        output = ModelOutput()

        max_depth = self.depth

        if output_layer_levels is not None:

            assert all(
                self.depth >= levels > 0 or levels == -1
                for levels in output_layer_levels
            ), (
                f"Cannot output layer deeper than depth ({self.depth})."
                f"Got ({output_layer_levels})."
            )

            if -1 in output_layer_levels:
                max_depth = self.depth
            else:
                max_depth = max(output_layer_levels)

        out = x

        for i in range(max_depth):

            out = self.layers[i](out)

            if output_layer_levels is not None:
                if i + 1 in output_layer_levels:
                    output[f"embedding_layer_{i+1}"] = out
            if i + 1 == self.depth:
                output["embedding"] = self.embedding(out.reshape(x.shape[0], -1))

        return output

the decoder:

class Decoder(nn.Module):
    def __init__(self, encoded_space_dim):
        super().__init__()

        self.input_dim = (1, 128, 128)  //here is te same difference as above, you have (1,28,28) 
        self.latent_dim = encoded_space_dim
        self.n_channels = 1

        layers = nn.ModuleList()

        layers.append(nn.Linear(encoded_space_dim, 128 * 16 * 16))  //Same as above, you had 128 * 4 * 4

        layers.append(nn.ConvTranspose2d(128, 128, 3, 2, padding=1, output_padding=1)) //this is the last modification, i add the output_padding because the tensor that results of this layer without it was (batch_size, 128,31,31) and i need one like (batch_size, 128,32,32), so that the final result would be 128x128

        layers.append(
            nn.Sequential(
                ResBlock(in_channels=128, out_channels=32),
                ResBlock(in_channels=128, out_channels=32),
                nn.ReLU(),
            )
        )

        layers.append(
            nn.Sequential(
                nn.ConvTranspose2d(128, 64, 3, 2, padding=1, output_padding=1),
                nn.ReLU(),
            )
        )

        layers.append(
            nn.Sequential(
                nn.ConvTranspose2d(
                    64, self.n_channels, 3, 2, padding=1, output_padding=1
                ),
                nn.Sigmoid(),
            )
        )

        self.layers = layers
        self.depth = len(layers)

    def forward(self, x, output_layer_levels: List[int] = None):

        output = ModelOutput()

        max_depth = self.depth
        if output_layer_levels is not None:

            assert all(
                self.depth >= levels > 0 or levels == -1
                for levels in output_layer_levels
            ), (
                f"Cannot output layer deeper than depth ({self.depth})."
                f"Got ({output_layer_levels})"
            )

            if -1 in output_layer_levels:
                max_depth = self.depth
            else:
                max_depth = max(output_layer_levels)

        out = x

        for i in range(max_depth):
            out = self.layers[i](out)

            if i == 0:
                out = out.reshape(x.shape[0], 128, 16, 16)

            if output_layer_levels is not None:
                if i + 1 in output_layer_levels:
                    output[f"reconstruction_layer_{i+1}"] = out

            if i + 1 == self.depth:
                output["reconstruction"] = out

        return output

and the RAE-L2, that is exactly like yours:

class RAE_L2(nn.Module):

    def __init__(self, latent_dim=16):
        super(RAE_L2, self).__init__()
        self.latent_dim = latent_dim
        self.encoder = Encoder(latent_dim)
        self.decoder = Decoder(latent_dim)

        self.model_name = "RAE_L2"

    def forward(self, x, **kwargs):
        z = self.encoder(x).embedding
        recon_x = self.decoder(z)["reconstruction"]
        loss, recon_loss, embedding_loss = self.loss_function(recon_x, x, z)
        return recon_x, loss

    def loss_function(self, recon_x, x, z):

        recon_loss = torch.nn.functional.mse_loss(
            recon_x.reshape(x.shape[0], -1), x.reshape(x.shape[0], -1), reduction="none"
        ).sum(dim=-1)

        embedding_loss = 0.5 * torch.linalg.norm(z, dim=-1) ** 2

        return (
            (recon_loss + 1e-2 * embedding_loss).mean(
                dim=0
            ),
            (recon_loss).mean(dim=0),
            (embedding_loss).mean(dim=0),
        )

and this was the output btw:

image

Thank you for your help.

clementchadebec commented 1 year ago

Thank you for the clarification! I just tested and your architectures look fine.

When you say it didn't work for the RAE_L2 and VAEGAN, do you mean that you encounter an error while running the code or the results are not good? And if you have an error can you share it with me please?

1180597-JoaoCampos commented 1 year ago

Sorry, I updated my question, when I say it doesn't work it means that when I plot the reconstructed images they look all empty.

clementchadebec commented 1 year ago

OK, thanks for clarifying. I am not that surprised that you struggle making it work with the VAEGAN since the adversarial training makes it a model quite tricky to train. From my personal experience, the parameters adversarial_loss_scale and reconstruction_layer have a huge impact on the model's performances and depend on the neural network architectures you consider. I had to try several different settings to make it work on MNIST and CELEBA as shown in the examples.

However, this is a bit more unexpected for the RAE_L2 which is basically an autoencoder with l2 regularization on the latent codes.

Have you tried the following:

1180597-JoaoCampos commented 1 year ago

I didn't try to reduce the factor before the embedding loss, can you give me an example of how to do that?

Regarding the evolution of the loss over the epochs, the value was quite high and seemed to have stagnated: image

clementchadebec commented 1 year ago

It depends on which implementation you are using. If you use yours for the RAE_L2 , the factor currently equals 1e-2 and can be set to zero below

return (
            (recon_loss + 1e-2 * embedding_loss).mean( ### here the 1e-2 can be set to 0
                dim=0
            ),
            (recon_loss).mean(dim=0),
            (embedding_loss).mean(dim=0),
        )

If you use pythae's implementation, you can do the following

from pythae.models import RAE_L2_Config, RAE_L2

config = RAE_L2_Config(
    input_dim=(1, 128, 128),
    embedding_weight=0, # changes the factor before the embedding loss
    reg_weight=0 # changes the weight decay for the decoder's optimizer
    )

model = RAE_L2(config, encoder, decoder)

Just to be sure, can you confirm that the networks you use for the RAE_L2 are the same as those for the VAE that actually works? Just to be sure that the issue does not come from the architectures.