TencentARC / Open-MAGVIT2

Open-MAGVIT2: Democratizing Autoregressive Visual Generation
Apache License 2.0
627 stars 24 forks source link

Reconstruction result #1

Closed Epiphqny closed 3 months ago

Epiphqny commented 3 months ago

Dear authors, thanks for your work. I tested the two models on my pictures, but the results seem not good. Is there anything wrong? thanks for you attention. Here are the two original pictures. 15_image_0_0_0 16_image_0_0_0 And the reconstruction using the IN128_Base: 0 8 And the reconstruction using the IN256_Base: 0

8

Also, as describe in the README, the model achieves 0.39 rFID for 8x downsampling, but it seems the provided model is 4x downsample? There is no 8x downsample model in the links. Thanks for your help!

image
RobertLuo1 commented 3 months ago

Thanks for your interest in our work. We would like to make a further clarification that when training $128 \times 128$ imagenet, our model has three downsampling layers, which means the image will be processed into $16 \times 16$. $16 \times 16$ for quantization is utilized in VQGAN and MAGVIT. Here, as stated in the caption, we use the 8x downsampling model to directly test on $256 \times 256$ images without finetuning, which means it has $32 \times 32$ 2D tokens to align with the paper that adopts 8x downsampling ratio. The facial reconstruction is quite hard if only trained on imagenet. But with the increase of tokens, it will get better results like directly testing on $512 \times 512$ images using the 8x downsampling model.

Epiphqny commented 3 months ago

Thanks for your quick reply, I will close the issue.