HaozheZhao / UltraEdit

150 stars 8 forks source link

Can we use image input with different resolutions? #11

Closed tageao460 closed 1 month ago

tageao460 commented 1 month ago

First of all, this project works very well with 512x512 images, great work!

However, when I tried to load a image with 1024x1024 resolution, the result is terrible.

As the paper mentioned, the model is trained with 256 × 256 and 512 × 512 for generation. This configuration is reasonable for sd 1.5, but is it a bit smaller for sdxl and sd3?

The input image is as follows, and the prompt is "Please add some apples on the table": 3a5075ee-b53d-4846-96fb-43f98ace5485

The result with 512x512 resolution: image

The result with 1024x1024 resolution: image1024

HaozheZhao commented 1 month ago

Thank you for your feedback!

For fair comparison, our model was trained with 256 × 256 and 512 × 512 image resolutions for SD1.5 to stay consistent with other counterparts. We have reported these results in the paper to highlight the advantages of our dataset.

However, through our recent tests and experiments, we have discovered that SDXL and SD3 indeed require higher resolutions. Specifically, due to the architecture of the DIT, SD3 shows poor performance when it handles an input resolution different from what it was trained on. This indicates a lack of generalization ability for generating images at various resolutions. Therefore, a potential solution is to retrain the model with 1024 × 1024 images for SD3. The demo we shared includes SD3 trained with the UltraEdit dataset at 512 × 512 resolution.

In contrast, SDXL continues to perform well when generalizing to different resolutions. We train it with 512*512 resolution and it shows good results during the inference with the 1024 × 1024 images.