TencentARC / Open-MAGVIT2

Open-MAGVIT2: Democratizing Autoregressive Visual Generation
Apache License 2.0
693 stars 28 forks source link

Why not use `Adaptive GroupNorm` On Decoder & Sampler asymmetrical? #26

Closed xesdiny closed 3 months ago

xesdiny commented 3 months ago
  1. Does the GourpNorm Shortcut LFQ Embedding trick not work very well? screenshot-20240716-150109
  2. class Encoder &class Decoder  Sampler
    ....
    **for i_level in range(self.num_blocks):**
            block = nn.ModuleList()
            block_in = ch*in_ch_mult[i_level] #[1, 1, 2, 2, 4]
            block_out = ch*ch_mult[i_level] #[1, 2, 2, 4]
            for _ in range(self.num_res_blocks):
                block.append(ResBlock(block_in, block_out))
                block_in = block_out
    
            down = nn.Module()
            down.block = block
            **if i_level < self.num_blocks - 1:
                down.downsample = nn.Conv2d(block_out, block_out, kernel_size=(3, 3), stride=(2, 2), padding=1)**
    
            self.down.append(down)
    ....
    **for i_level in reversed(range(self.num_blocks)):**
            block = nn.ModuleList()
            block_out = ch*ch_mult[i_level]
            for i_block in range(self.num_res_blocks):
                block.append(ResBlock(block_in, block_out))
                block_in = block_out
    
            up = nn.Module()
            up.block = block
            **if i_level > 0:
                up.upsample = Upsampler(block_in)**
            self.up.insert(0, up)

    means

    Down   : Y   Y   Y  Y   N
    BlockNum:0   1   2  3   4
    Up      :Y   Y   Y   Y  N
    BlockNum:4   3   2   1  0

    Are Encoder downsampling and Decoder upsampling asymmetrical at the same Dim?

RobertLuo1 commented 3 months ago

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?

xesdiny commented 3 months ago

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?

First of all, thank you for your answer to the first question. Regarding the second question, the entire codec is constructed with 5 layers of Resnet + (W/O) Down/Up sampler, where the encoder downsamples 2 times in each layer from 0 to 3, a total of 16 times length and width compression, and the 4th layer, which is the layer with the highest dimension, is not downsampled; However, the decoder reversely starts upsampling from the 4th layer to the 1st layer, and the last 0th layer is not downsampled; The final result is an asymmetric compression ratio codec, which is inconsistent with the 4x8x8 structure compression in the Magvit2 appendix. From the above appendix figure, it can be seen that the 4th layer of the decoder, which is closest to the discrete part, does not have T-Causal upsampling operation.

xesdiny commented 3 months ago

By the way, how is this conditional discriminator designed? I see that the NLayerDiscriminator section does not have any implementation for adding cond to forward.

shinshiner commented 3 months ago

Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?

@RobertLuo1 The full-gear generator is still within image tokenization? Or you will also reproduce the video tokenizer?

RobertLuo1 commented 3 months ago

@xesdiny Hi, I think when you dive a little deeper in the code you will find the different initialization of resblock in encoder and decoder, where you can see the code https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L155 and https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L88. Note that the final layer is not downsampled and the code is shown here https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L93

RobertLuo1 commented 3 months ago

@shinshiner Hi, Currently, we still operate on Image tokenization and the subsequent AutoRegressive Generation. Later we will continue on upgrading the tokenizer into Video.