Closed xesdiny closed 3 months ago
Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?
Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?
First of all, thank you for your answer to the first question. Regarding the second question, the entire codec is constructed with 5 layers of Resnet + (W/O) Down/Up sampler, where the encoder downsamples 2 times in each layer from 0 to 3, a total of 16 times length and width compression, and the 4th layer, which is the layer with the highest dimension, is not downsampled; However, the decoder reversely starts upsampling from the 4th layer to the 1st layer, and the last 0th layer is not downsampled; The final result is an asymmetric compression ratio codec, which is inconsistent with the 4x8x8 structure compression in the Magvit2 appendix. From the above appendix figure, it can be seen that the 4th layer of the decoder, which is closest to the discrete part, does not have T-Causal upsampling operation.
By the way, how is this conditional discriminator designed?
I see that the NLayerDiscriminator
section does not have any implementation for adding cond to forward
.
Hi, Thanks for your interest in our work. For the first question, in the current released version, we use a relatively small model to validate our training recipe. Moreover, according to the ablation studies provided in MAGVIT-2, we temporarily ablate the ADAGN. Actually, In our current version, we are actively training the full-gear generator which is well aligned with MAGVIT-2 Paper. Stay tuned for the next huge update! I do not understand the second question quite well. Can you make it more specific?
@RobertLuo1 The full-gear generator is still within image tokenization? Or you will also reproduce the video tokenizer?
@xesdiny Hi, I think when you dive a little deeper in the code you will find the different initialization of resblock in encoder and decoder, where you can see the code https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L155 and https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L88. Note that the final layer is not downsampled and the code is shown here https://github.com/TencentARC/Open-MAGVIT2/blob/6bfab153fb2fe612edbf8afa9b3a607d8337c194/taming/modules/diffusionmodules/improved_model.py#L93
@shinshiner Hi, Currently, we still operate on Image tokenization and the subsequent AutoRegressive Generation. Later we will continue on upgrading the tokenizer into Video.
means
Are Encoder downsampling and Decoder upsampling asymmetrical at the same Dim?