Stability-AI / StableCascade

Official Code for Stable Cascade
MIT License
6.54k stars 533 forks source link

Ideas for improving Stable Cascade's fine details. #31

Open Arcitec opened 8 months ago

Arcitec commented 8 months ago

I wish you lots of success with this fantastic new release! It's an incredible achievement for prompt coherence and complex hands and feet! I am stunned to see overlapping hands holding a flower, and seeing all the correct fingers. This new technology is amazing and you should be very proud. Amazingly good job by everyone involved!

Since this is a research project, I am curious. The model is very good already. But humans tend to look a bit like plastic due to the super smooth skin. Do you think that the "soft, smooth-skinned, airbrushed" look of skin will be curable via further finetuning? Or is that some kind of limitation of the small, internal latent space? Or maybe even the training data or dimensions?

I would guess that it is fixable via further refinement of the stages that add the final details (Stage B seems to be the fine details stage?).

Alternatively, users can of course add a typical "detailer" stage after the final image stage, to get crisp details. But I guess that such a tweak wouldn't be needed if Stable Cascade can be slightly revised to become better at details.

bghira commented 8 months ago

they are currently working on a DiT model that has a lot more effort being put into it. This is a research artifact. The new transformer model has thousands of nodes.

This model (Cascade) is bottlenecked by the VQGAN, and its reconstructive stage that reduces the latents to 4 channels. The base model (stage C) has 16 channels, that's a fair amount of information reduction.

Arcitec commented 8 months ago

they are currently working on a DiT model that has a lot more effort being put into it. This is a research artifact. The new transformer model has thousands of nodes.

For anyone else wondering: https://huggingface.co/docs/transformers/en/model_doc/dit

Sounds good that they're making a new DiT with thousands of nodes, if I understood you correctly. Sounds like that would be great for details.

But what do you mean that Stable Cascade is "a research artifact"? The team worked with the Würstchen developers to evolve that architecture and seemed very proud on Twitter about this new network. They have also confirmed that Stable Cascade will become a commercial product when this development is done.

This model (Cascade) is bottlenecked by the VQGAN, and its reconstructive stage that reduces the latents to 4 channels. The base model (stage C) has 16 channels, that's a fair amount of information reduction.

Hmm. Yeah, the Stage C has 16 (I remember reading 13 but I guess that was wrong) channels in a 24x24 grid. That's required to compress all of the prompt information into such a small grid.

It then gets enlarged (decompressed) by Stage B. But even though that enlargement reduces the channels from 16 to 4, it's still a much larger latent image after decompression/enlargement0, so it makes sense to reduce the information density (channels).

What do you mean that it's bottlenecked by the adversarial network btw? Do you mean that the adversarial network wasn't good enough at determining which images were detailed vs blurry?

I could believe that. Because the results, while the composition is good, are all pretty blurry. Sort of like "sharp edges, blurry textures". It definitely needs some work.

I'd guess Stage B (the enlargement) is the biggest culprit for this blurred effect. Because they mention that this layer is responsible for fine details.

Do you know if Stage A (the VAE decoder) is responsible for fine details/further enlargement too? Edit: Sounds like Stage A is already capable of fine details.

pwncups commented 3 months ago

Following up on this - as usual, fine tuning the B-stages (much like madebyollin's work on Stage-A, even the lite variant) produces noticeably finer detail than the released variants. I am somewhat actively perplexed as to why but it seems that the B-stage maybe could've used re-training without the CLIP/Tenc needs as well as its effectively not using them and a waste of VRAM.

A simple 1k steps of UHD images using a modified version of the train_b.py script, batch size of 1, AdamW (on 4090). I'd be curious to see if anyone else can confirm similar findings.

bghira commented 3 months ago

i think it's because text-conditional guidance improves FID and CLIP scores a bit during the initial toy model experimentation stages just after they validate architectures using ImageNet or CIFAR etc

you see this in the DeepFloyd architecture. the stage II model is needlessly text-conditioned and uses the T5 embeds.

the SDXL refiner makes use of the single OpenCLIP-G model, and only trained on the final 200 steps of the schedule! when this kind of guidance isn't even used by the model.. though it does have substantial impact with the aesthetic scores holding back floodgates of watermarks and other poor quality results. it's just that the prompt itself doesn't.

again the Cascade arch passes the text guidance needlessly into the stage B. but also, Wurstschen v2 didn't do that. it used two separate conditionings for stage C and B! with OpenCLIP-H for stage C and OpenCLIP-G for stage B. @pcuenca do you have insight for us as to why this choice maybe made?