Question about VAE adaption stage

PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

https://pixart-alpha.github.io/PixArt-sigma-project/

GNU Affero General Public License v3.0

1.63k stars 77 forks source link

Question about VAE adaption stage #50

Closed piddnad closed 4 months ago

piddnad commented 5 months ago

Thank you for sharing such impressive work!

I am particularly interested in the VAE adaption stage mentioned in the paper. It's mentioned that it was conducted at a 256x256 resolution.

I'm wondering, was this done by loading the PixArt-alpha pretrained weights from the High aesthetics stage, and then using to the 33M internal-sigma data for adaption? Is my understanding correct?

I have tried a training using PixArt-alpha 256-SAM weights, replacing SD1.5 VAE with SDXL VAE, training on SAM data and it seems to be difficult to converging in the short term (for now 10k steps), do you know what the problem might be?

Thank you in advance for your response.

lawrence-cj commented 5 months ago

I'm wondering, was this done by loading the PixArt-alpha pretrained weights from the High aesthetics stage, and then using to the 33M internal-sigma data for adaption? Is my understanding correct?

This is correct. Share some results of your training here.

piddnad commented 5 months ago

Here are some of my results. The resulting images seem to have some chunking and blurring:

prompt: A lovely young lady, with a smile on her face...

prompt: city skyline at night...

lawrence-cj commented 5 months ago

which training and test code are you using?

piddnad commented 5 months ago

which training and test code are you using?

I'm using the training and validation code based on train_diffusers.py of PixArt-alpha.

lawrence-cj commented 5 months ago

Check your vae and scale_factor. BTW, the training with diffusers is not stable. That's why we haven't changed all the code base to diffusres.

piddnad commented 5 months ago

Hello, I have carefully reviewed the vae code and used the original training code, and the results have not changed.

However, I have conducted a few more adaption experiments with the SDXL VAE, and there are some interesting findings to share.

The setups of the experiments were as follows:

Loading PixArt-256-SAM and training on SAM data (the initial experiment)
Loading PixArt-256-AES and training on SAM data
Loading PixArt-256-SAM and training on data similar to JourneyDB with high aesthetic scores
Loading PixArt-256-AES and training on data similar to JourneyDB with high aesthetic scores

Through these experiments, I observed some interesting phenomena: in terms of the final generated effect after the transfer, 4 > 3 ≈ 2 >> 1. Therefore, I speculate that both high-aesthetic pre-trained models and high-aesthetic data are beneficial to the VAE adaption.

Below are some visual examples: (from left to right: 4,2,3,1)

lawrence-cj commented 4 months ago

Cool. Pretty interesting results. Thanks a lot for sharing.