Open k-zha14 opened 2 weeks ago
i think the devs are working on a new similar vae. Will see
Have you changed the vae's scale_factor accordingly?
Have you changed the vae's scale_factor accordingly?
Absolutely! Specifically, the SDv3's VAE adds 'scale_factor'. So the normalization of VAE's latent should be:
latents = (latents - self.config.shift_factor) * self.config.scaling_factor
Waiting for more progress! Let's work together to make the PixArt community more prosperous. CC @ReyJ94 @lawrence-cj
Hi, great work! I succeed in reproducing the VAE adaption from SD2' to SDXL's, as discussed in Pixart-Sigma. However, the adaption to SDXL's VAE is not successful. After 10k steps finetuning in SAM, sampled images are meaningless and chaotic(attached below), although the training loss looks pretty good.
The key change of SDv3's VAE is, the latent channel expands from 4 to 16. Thus, the compressed latents will reserve more details and avoid unplesant artifacts(eg. little faces, texts). To accommodate with this change, I initialize the net with official 'Pixart-alpha-256x256.pt' weights, except the 'x_embed' layer and 'final_layer'(channels 4->16, 8 -> 32).
Could anyone give me some hints? I'm really confused. Thanks, guys!