Closed Epiphqny closed 5 months ago
Thanks for your interest in our work. We would like to make a further clarification that when training $128 \times 128$ imagenet, our model has three downsampling layers, which means the image will be processed into $16 \times 16$. $16 \times 16$ for quantization is utilized in VQGAN and MAGVIT. Here, as stated in the caption, we use the 8x downsampling model to directly test on $256 \times 256$ images without finetuning, which means it has $32 \times 32$ 2D tokens to align with the paper that adopts 8x downsampling ratio. The facial reconstruction is quite hard if only trained on imagenet. But with the increase of tokens, it will get better results like directly testing on $512 \times 512$ images using the 8x downsampling model.
Thanks for your quick reply, I will close the issue.
Dear authors, thanks for your work. I tested the two models on my pictures, but the results seem not good. Is there anything wrong? thanks for you attention. Here are the two original pictures. And the reconstruction using the IN128_Base: And the reconstruction using the IN256_Base:
Also, as describe in the README, the model achieves 0.39 rFID for 8x downsampling, but it seems the provided model is 4x downsample? There is no 8x downsample model in the links. Thanks for your help!