Stability-AI / generative-models

Generative Models by Stability AI
MIT License
24.46k stars 2.73k forks source link

Design question: Why don't you use v-prediction target? #108

Open bonlime opened 1 year ago

bonlime commented 1 year ago

Hi! First of all thanks for a very good model. The Stable Diffusion v2 used v-prediction target and argued that it's better than default epsilon prediction, but why do you use the epsilon target for SDXL training again?

JincanDeng commented 1 year ago

Same question. Somebody knows the reason?

bghira commented 1 year ago

should also ask what were the results (if tested) of x-prediction, and how come that isn't used.

bghira commented 1 year ago

i've got a version of SDXL with v-prediction and zero-terminal SNR :-)

image

bonlime commented 1 year ago

@bghira interesting! could you provide any details on how long does the fine-tuning take? ~estimate of GPU hours + GPU used would be sufficient Also how does it compare to vanilla SDXL in your experiments?

bghira commented 1 year ago

on a single A100-80G it's taking an eternity. would love to have the compute that was offered by StabilityAI months ago but I've had to do it all on my own.

the contrast is much better on SDXL once you switch to v-pred / zero-terminal SNR. but coherence suffers, presumably because of my low batch size.

currently got a test going on 8x A6000 with 4*4*8 batch size configuration, and it learns much more quickly, but at far higher cost.

currently on 16,000 steps and i expect about 50,000-60,000 will be needed to fully reproduce the results of the Bytedance paper that introduced this noise schedule, which matches their results too.

we see 90 seconds per iteration. 400 GPU hours to hit 16,000 steps, or, a little over 2 weeks of constant training.

is anyone from Stability AI even paying attention to this repo anymore? @mcmonkey4eva ?

bonlime commented 1 year ago

@bghira woah, that's a lot of compute, interesting to see what would come out of it

bghira commented 1 year ago

here's some more cherry-picked results. it's starting to feel like the removal of the attention from the high res layers means the model can't really learn fine details. this is with a timestep training bias toward the final 20% of timesteps, too. you see the fine details end up as a grid of artifacts almost.

another thing is the splotchy contrast, presumably due to the long term use of offset noise during SDXL's initial training. that stuff is basically impossible to remove.

image image image image image

bonlime commented 1 year ago

this is with a timestep training bias toward the final 20% of timesteps you're only training base image on the [0.2, 1] % of timesteps, and plan to use the vanilla refiner on top of it, right? i've also observed that by default base model is not really good at tiny details but it doesn't usually matter, since refiner can improve everything

bghira commented 1 year ago

no, there is no v-prediction refiner. i am training on 1000 timesteps, but a bias for 25% of them.

bghira commented 11 months ago

just an update on this, i personally went ahead and made a v-prediction model from scratch using min-snr-gamma. you can use it as ptx0/terminus-xl-gamma-v1 or a WIP checkpoint at ptx0/terminus-xl-gamma-training - this one is the latest/greatest.

some of the more recent observations are that v-prediction works at a much lower CFG and with many fewer steps than an epsilon XL model does. much better fine details and contrast.

no reason to make epsilon models anymore - the only benefit is training is more stable, which is honestly not a good enough reason to use it. I trained my model on a single GPU.

bonlime commented 11 months ago

@bghira just to clarify - your experience is that it's better to train from scratch, rather than trying to fine-tune with new prediction target?

do you think it would be possible to train a v-prediction version for Consistency Models as well (LCM)? Not by you, just theoretically do you envision any problems with that?

bghira commented 10 months ago

terminus-xl-gamma-v2 is released now with major improvement in quality.