PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
https://pixart-alpha.github.io/PixArt-sigma-project/
GNU Affero General Public License v3.0
1.43k stars 66 forks source link

Adapt to SD-v3's VAE #114

Open k-zha14 opened 2 weeks ago

k-zha14 commented 2 weeks ago

Hi, great work! I succeed in reproducing the VAE adaption from SD2' to SDXL's, as discussed in Pixart-Sigma. However, the adaption to SDXL's VAE is not successful. After 10k steps finetuning in SAM, sampled images are meaningless and chaotic(attached below), although the training loss looks pretty good.

image (3) image (2) image (1) image (the first two images generated from the the adaption exp. - SD2's to SDXL's VAE, while the latter from SD2' to SDv3's VAE)

The key change of SDv3's VAE is, the latent channel expands from 4 to 16. Thus, the compressed latents will reserve more details and avoid unplesant artifacts(eg. little faces, texts). To accommodate with this change, I initialize the net with official 'Pixart-alpha-256x256.pt' weights, except the 'x_embed' layer and 'final_layer'(channels 4->16, 8 -> 32).

Could anyone give me some hints? I'm really confused. Thanks, guys!

ReyJ94 commented 1 week ago

i think the devs are working on a new similar vae. Will see

lawrence-cj commented 1 week ago

Have you changed the vae's scale_factor accordingly?

k-zha14 commented 1 week ago

Have you changed the vae's scale_factor accordingly?

Absolutely! Specifically, the SDv3's VAE adds 'scale_factor'. So the normalization of VAE's latent should be:

latents = (latents - self.config.shift_factor) * self.config.scaling_factor

Waiting for more progress! Let's work together to make the PixArt community more prosperous. CC @ReyJ94 @lawrence-cj