AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.53k stars 26.45k forks source link

Open-clip upgrade #551

Closed Ehplodor closed 1 year ago

Ehplodor commented 1 year ago

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like Upgrading to the latest open-clip large model released today ?

Describe alternatives you've considered No alternative

Additional context https://laion.ai/blog/large-openclip/ https://twitter.com/EMostaque/status/1570501470751174656?s=20&t=wSi9ttdQoH141WM6lZ1Xmw

artificialguybr commented 1 year ago

I want this too

Arcitec commented 1 year ago

This is stunning. So H/14 was released on Sept 14th and is a massive upgrade for its ability to comprehend text prompts and generate complex images.

I wonder what type of GPU can load the model. Even 3090 with 24 GB VRAM may not be enough?

Arcitec commented 1 year ago

Oh and they say that H/14 represents the completion of replicating OpenAI's paper as fully open source:

Producing the best open source CLIP model out of this data set completes the open source replication of the excellent CLIP paper that OpenAI released one year ago.

So if I understood correctly, they're saying that these are the most advanced algorithms and neural networks that OpenAI's paper published.

Arcitec commented 1 year ago

Sounds like those models are not directly related to Stable Diffusion, but are much better than Stable Diffusion:

https://the-decoder.com/new-clip-model-aims-to-make-stable-diffusion-even-better/

trufty commented 1 year ago

Yea you can't just drop in H-14. SD would need to be retrained from scratch.
However Clip guided diffusion could be implemented which is a somewhat less effective (and slower) alternative.

https://github.com/huggingface/diffusers/commit/dc2a1c1d07bef046a76491ee5d4aab61ecfd67bc

Arcitec commented 1 year ago

@trufty Interesting.

  1. What is the CLIP-Guided Diffusion? Is that where CLIP is used as output classifier until Stable Diffusion has generated some image output that H-14's CLIP classifies as matching the prompt? Since the H-14 CLIP is much better at classifying objects, it should lead to better output.

If I understood that correctly, that sounds like a good improvement.

  1. Is there some scenario where OpenCLIP H-14 can be used on its own (without SD then, I guess), as the image generator... to get all the accuracy and quality benefits that OpenCLIP's article talks about above? From what I read, OpenCLIP is pretty much on par with DALL-E2, and is now the best open-source AI art generator available.
trufty commented 1 year ago

CLIP-Guided Diffusion basically runs clip against every step of the generation to push the next step more towards the desired direction. It should lead to better output, but I haven't actually tried it myself.

for point 2 Probably not. It excels at matching text embeddings to image imbeddings, but the diffusion unet does all the image generation heavy lifting

Arcitec commented 1 year ago

CLIP-Guided Diffusion basically runs clip against every step of the generation to push the next step more towards the desired direction. It should lead to better output, but I haven't actually tried it myself.

Sounds good, but seems very slow since classification runs very slowly whenever I upload an image and "Interrogate" it in img2img on a RTX 3090.

for point 2 Probably not. It excels at matching text embeddings to image imbeddings, but the diffusion unet does all the image generation heavy lifting

Oh okay. I misunderstood the article then. So OpenCLIP is really just the image classifier portion.

This part of the article actually seems to put it well:

"In the generative AI models for images created after DALL-E 1, CLIP often takes a central role, for example in CLIP+VQGAN, CLIP-guided diffusion, or StyleGAN-NADA. In these examples, CLIP computes the difference between an input text and an image generated by, say, a GAN. The difference is minimized by the model to produce a better image.

In contrast, in newer models, such as DALL-E 2 or Stable Diffusion, CLIP encoders are directly integrated into the AI model and their embeddings are processed by the diffusion models used."

So it's saying that CLIP will be useful for checking the quality of generated images, or for classifying images. So it seems like OpenCLIP won't be directly used for making images.

Interestingly, it mentions that Stable Diffusion has a type of CLIP encoding built into the network itself. So the advances from OpenCLIP H/14 (new network architecture and training data I guess), which were funded by Stability AI, could then be used by Stability AI to integrate with a later version of Stable Diffusion, I suspect. But for now, it seems like we won't get any advances from OpenCLIP H/14 in Stable Diffusion. It would require a new Stable Diffusion network and new model.

I am sure that Stability AI will integrate anything useful into newer models as they continue evolving Stable Diffusion! :)

Arcitec commented 1 year ago

Found some official news from Stability AI from Aug 23:

https://twitter.com/EMostaque/status/1562146080820715521

...with the clip Vit-h we will release soon and the UL2 embedded image model compositionality goes waaay up at the expense of speed and accessibility

So yes they are planning to add OpenCLIP H to the Stable Diffusion network design. I am sure their new model is being trained as we speak.

I am also sure it will need a lot of VRAM. Maybe even more than 24GB (would suck, but they did mention that the new model will not be usable by most people).

Edit: Found confirmation that they are indeed adding OpenCLIP Vit-H to Stable Diffusion. This was a question about why they use OpenCLIP. The answer confirms it and that the H model is coming soon:

https://twitter.com/EMostaque/status/1558860841017118721

CLIP Vit-H finishes up the OpenCLIP model set and will be useful for research as well as real world applications.

0xdevalias commented 1 year ago

It would also be cool to be able to use the new OpenClip models in the 'Interrogate CLIP' feature. Here's a related issue on another repo:

And my exploration + hacky implementation for that repo to use it:

tl;dr: It was pretty trivial to swap from the original clip library to the OpenClip library, which then made it simple to change the model used for CLIP interrogation to use the new OpenClip models.

Ehplodor commented 1 year ago

Hi @0xdevalias what do you think about the possibility of swapping the "old" clip model's weights for the newest's, in SD v1.4/1.5 ? Is it really impossible/irrelevant as stated earlier ? or is SD some kind of plug-n-play architecture where each component can be modified at will ?

0xdevalias commented 1 year ago

@Ehplodor Unfortunately that's well beyond my current knowledge/understanding of things; though when I was searching for a similar question the main suggestion seemed to be that SD would need to be re-trained using the new CLIP models to see the biggest benefit.

The best info I could find about how to potentially 'enhance' the current SD models was by using CLIP guidance from the new OpenClip models, though that was apparently much slower, and can have mixed results on quality.

I can't say with 100% certainty if that is the full truth as I don't have the knowledge/haven't dug deep enough into things to explore if otherwise is possible though.

Ehplodor commented 1 year ago

Stable diffusion v2 is out with the new openClip : https://github.com/Stability-AI/stablediffusion