MoVQ implementation question

kebijuelun commented 9 months ago

I have some questions regarding the implementation of MoVQ and would appreciate your clarification.

From the original MoVQ paper, it is mentioned that a multi-channle VQ is adopted.

However, the implementation of kandinsky3 does not involve any vector quantization operation:

class MoVQ(nn.Module):

    def __init__(self, generator_params):
        super().__init__()
        z_channels = generator_params["z_channels"]
        self.encoder = Encoder(**generator_params)
        self.quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.post_quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.decoder = Decoder(zq_ch=z_channels, **generator_params)

    @torch.no_grad()
    def encode(self, x):
        h = self.encoder(x)
        h = self.quant_conv(h)
        return h

    @torch.no_grad()
    def decode(self, quant):
        decoder_input = self.post_quant_conv(quant)
        decoded = self.decoder(decoder_input, quant)
        return decoded

May I ask if it is a misunderstanding on my part regarding MoVQ, or if Kandinsky has made some modifications to the implementation of MoVQ?

pekinghk commented 6 months ago

I've noticed the same issue. It appears that the author has just utilized the encoder and decoder network architecture from movqgan and has retrained a VAE model. I'm not sure if my understanding is correct, so I would appreciate clarification from the author.

FlyHighest commented 6 months ago

After training, each token of the encoder's output is very close to a certain vector in the codebook. I've tried adding the VQ to refine the encoder's output and find the MSE between the features before VQ and after VQ is small, thus w/ VQ or w/o VQ produces nearly the same decoding results.

ai-forever / Kandinsky-3

MoVQ implementation question #10