CompVis / taming-transformers

Taming Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2012.09841
MIT License
5.83k stars 1.15k forks source link

Implementation question - Resize then crop - Model trained on low resolution image? #214

Open HieuPhan33 opened 1 year ago

HieuPhan33 commented 1 year ago

Hi, from the config file sflckr_cond_stage.yaml, the image is resized to SmallestMaxSize=256 and then crop. So the model was trained on smaller (resized) images. Is the model ckpt also trained on the same way?

When inferencing on high resolution image, VQ-GAN takes a crop of high-resolution input and performs sliding window inference. There is a little input inconsistence in training image (trained on low resolution) and sampled image (inference on high resolution). Could you please clarify my understanding is correct?

kurlapov commented 1 year ago

Based on the information you provided, it seems that the model was trained on resized images with a maximum size of 256 and then cropped. However, there is some inconsistency between the training images (which are low resolution due to resizing) and the sampled images during inference (which are high resolution). Your understanding is correct, and I'll explain it in more detail below.

During training, if the model was trained on resized images with a maximum size of 256 and then cropped, it means that the training data used for the model consists of images that were resized to have a maximum side length of 256 pixels. The cropping process likely helps create data augmentation, ensuring that the model can learn from different parts of the images during training.

However, during inference (when you use the model to generate new images), if you input a high-resolution image, the model will first take crops (patches) of the high-resolution input using sliding windows. It performs this sliding window inference to cover the entire high-resolution image and generate the output piece by piece. This way, the model can handle larger images during inference without any memory constraints.

As you correctly pointed out, there is an inconsistency between the training images (low resolution due to resizing and cropping) and the way the model processes high-resolution images during inference (by taking patches). This inconsistency might lead to some differences in the output when the model is generating high-resolution images compared to the training data, which could result in a loss of fine details or artifacts in the generated images.

It's important to consider the model's limitations and understand that it may not perform optimally on high-resolution images because it was primarily trained on smaller and cropped data. If you want the model to handle high-resolution images better, you might need to consider retraining the model on high-resolution data or using a different approach that handles higher resolution inputs more effectively.

Marcelo5444 commented 1 year ago

Hi, The model have been trained in images between [0,1] or [0,255]. Looking at the transformations, it looks like the latter.