Closed luohao123 closed 2 months ago
From the method mentioned in the paper, if the output resolution are huge, such as 1024x2048, the actually generation time would be much more large than diffusion model.
So, in large image generation, what is the strength actually of this method?
@luohao123
It's not actually the case for a couple of reasons:
Thanks for the reply, the demo images are all not high resolution enough, for instance, 104x1024, and th quality, actaully still a gap between SOTA diffusion models.
What could be reason for this?
Thanks for the reply, the demo images are all not high resolution enough, for instance, 104x1024, and th quality, actaully still a gap between SOTA diffusion models.
What could be reason for this?
@luohao123
This is a class-conditioned image generation framework. We conducted experiments with ImageNet at resolutions of 256x256 and 512x512, and compared the class-conditioned results at these lower resolutions with those text-to-image diffusion models at 1024x1024, which is clearly unfair. Additionally, are you referring to text-to-image generation? We will have work in this area released soon.
Yes, am telling text-to-image, looks like current is conditioned image generation.
Does the text-to-image with next-scale prediction, comparable with diffuison models such as stablediffusion on large size images?
Yes, am telling text-to-image, looks like current is conditioned image generation.
Does the text-to-image with next-scale prediction, comparable with diffuison models such as stablediffusion on large size images?
@luohao123
Under the same conditions of data quality and training epochs, VAR can achieve or even exceed the performance of diffusion.
Additionally, it is worth noting that our method has better scalability, meaning that when the model size scales up to 10B, 20B, the potential of our model will be even greater.
really looking forward to your text to image work out!
From the method mentioned in the paper, if the output resolution are huge, such as 1024x2048, the actually generation time would be much more large than diffusion model.
So, in large image generation, what is the strength actually of this method?