Closed jpmcarvalho closed 6 months ago
Hello,
The clamping is essential when we are dealing with the pixel space(but not really needed when we are dealing with latent space). When we decode the generated latent sample, we want our generated image be in valid pixel range, hence the clamping, but in the latent space, we don't necessarily need that. Now while its not essential, the argument could be made that having the latents bounded(-1 to 1) using something like tanh function would be better for the diffusion model than the current unbounded case but from what I could understand looking at the code of official stable diffusion repo, they go with unbounded latent outputs - https://github.com/CompVis/stable-diffusion/blob/21f890f9da3cfbeaba8e2ac3c425ee9e998d5229/scripts/txt2img.py#L314 and do clamping only for the pixel space generated image.
Thank you!
Hello, When you run infer_vqvae.py you save the latent information (encoded information) but you do not clamp it (torch.clamp(encoded_output, -1., 1.)).
I also checked when you read it from dataset and when the variable of use_latents is equal to True you don't clamp it.
Maybe its a bug?
Thank you!