Open bonlime opened 1 year ago
Same question. Somebody knows the reason?
should also ask what were the results (if tested) of x-prediction, and how come that isn't used.
i've got a version of SDXL with v-prediction and zero-terminal SNR :-)
@bghira interesting! could you provide any details on how long does the fine-tuning take? ~estimate of GPU hours + GPU used would be sufficient Also how does it compare to vanilla SDXL in your experiments?
on a single A100-80G it's taking an eternity. would love to have the compute that was offered by StabilityAI months ago but I've had to do it all on my own.
the contrast is much better on SDXL once you switch to v-pred / zero-terminal SNR. but coherence suffers, presumably because of my low batch size.
currently got a test going on 8x A6000 with 4*4*8
batch size configuration, and it learns much more quickly, but at far higher cost.
currently on 16,000 steps and i expect about 50,000-60,000 will be needed to fully reproduce the results of the Bytedance paper that introduced this noise schedule, which matches their results too.
we see 90 seconds per iteration. 400 GPU hours to hit 16,000 steps, or, a little over 2 weeks of constant training.
is anyone from Stability AI even paying attention to this repo anymore? @mcmonkey4eva ?
@bghira woah, that's a lot of compute, interesting to see what would come out of it
here's some more cherry-picked results. it's starting to feel like the removal of the attention from the high res layers means the model can't really learn fine details. this is with a timestep training bias toward the final 20% of timesteps, too. you see the fine details end up as a grid of artifacts almost.
another thing is the splotchy contrast, presumably due to the long term use of offset noise during SDXL's initial training. that stuff is basically impossible to remove.
this is with a timestep training bias toward the final 20% of timesteps you're only training base image on the [0.2, 1] % of timesteps, and plan to use the vanilla refiner on top of it, right? i've also observed that by default base model is not really good at tiny details but it doesn't usually matter, since refiner can improve everything
no, there is no v-prediction refiner. i am training on 1000 timesteps, but a bias for 25% of them.
just an update on this, i personally went ahead and made a v-prediction model from scratch using min-snr-gamma. you can use it as ptx0/terminus-xl-gamma-v1
or a WIP checkpoint at ptx0/terminus-xl-gamma-training
- this one is the latest/greatest.
some of the more recent observations are that v-prediction works at a much lower CFG and with many fewer steps than an epsilon XL model does. much better fine details and contrast.
no reason to make epsilon models anymore - the only benefit is training is more stable, which is honestly not a good enough reason to use it. I trained my model on a single GPU.
@bghira just to clarify - your experience is that it's better to train from scratch, rather than trying to fine-tune with new prediction target?
do you think it would be possible to train a v-prediction version for Consistency Models as well (LCM)? Not by you, just theoretically do you envision any problems with that?
terminus-xl-gamma-v2 is released now with major improvement in quality.
Hi! First of all thanks for a very good model. The Stable Diffusion v2 used
v-prediction
target and argued that it's better than defaultepsilon
prediction, but why do you use theepsilon
target for SDXL training again?