Stability-AI / stablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
38.37k stars 4.95k forks source link

Issue with get_learned_conditioning while running your Stable Diffusion Version 2 #203

Open pankaja0285 opened 1 year ago

pankaja0285 commented 1 year ago

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

so in my case the command above will be something like this python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

Issue is I get an error on lines 342 and 345 _if opt.scale != 1.0: uc = model.get_learned_conditioning(batch_size * [""]) # <-- line 342 if isinstance(prompts, tuple): prompts = list(prompts) c = model.get_learnedconditioning(prompts) # <-- line 345

I put the code in a try... except block in your function get_learned_conditioning (from ddpm.py) And that is how I was able to capture the following error

_Expected attn_mask dtype to be bool or to match query dtype, but got attnmask.dtype: float and query.dtype: struct c10::BFloat16 instead.

With the above python command - it defaults to ddim sampler. Not sure it being a DDIM Sampler, it is not able to calculate the learned conditioning.

I even set up a sampler type and tried to skip calling the model.get_learned_conditioning(prompts), but then in the sample generation on line 347 samples, = sampler.sample(S=opt.steps, conditioning=c, batch_size=opt.n_samples, shape=shape, verbose=False, unconditional_guidance_scale=opt.scale, unconditional_conditioning=uc, eta=opt.ddim_eta, x_T=startcode) # <-- line 347 it fails as both c and uc are None

NOTE: I am running on CPU only.

Please take a look at this and respond back.

garasubo commented 1 year ago

I met a similar issue. Here is the error message when I ran python scripts/txt2img.py --n_samples=1 --prompt "a professional photograph of an astronaut riding a horse" --ckpt ../stable-diffusion-2-1/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
data:   0%|                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
Sampling:   0%|                                                                                                                                                                    | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
    main(opt)
  File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 342, in main
    uc = model.get_learned_conditioning(batch_size * [""])
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 665, in get_learned_conditioning
    c = self.cond_stage_model.encode(c)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 236, in encode
    return self(text)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 213, in forward
    z = self.encode_with_transformer(tokens.to(self.device))
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 220, in encode_with_transformer
    x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 232, in text_transformer_forward
    x = r(x, attn_mask=attn_mask)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 154, in forward
    x = x + self.ls_1(self.attention(self.ln_1(x), attn_mask=attn_mask))
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 151, in attention
    return self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1189, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/functional.py", line 5334, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and  query.dtype: c10::BFloat16 instead.

My environment: OS: Ubuntu 22.04 GPU: NVIDIA GeForce RTX 2060 Cuda: 11.8

moefear85 commented 1 year ago

@pankaja0285 if you know it's not yours, then why are you using it? Develop your own ai on your own if you think you can do it better

andrewivan123 commented 1 year ago

Add --device cuda python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --device cuda

garasubo commented 1 year ago

@andrewivan123 Nice, that works for me. Thank you for your kind help!

ChipsSpectre commented 1 year ago

@pankaja0285 The error is caused by trying to use the proprietary cuda datatype BFloat16 during CPU-based inference.

By using full precision, aka Float32 datatype, your problem should be fixed / just add this to your command line:

--precision=full

mschoenebeck commented 1 year ago

@ChipsSpectre Thanks for the hint! It now throws:

$ python3 scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt models/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --precision=full
Global seed set to 42
Loading model from models/v2-1_768-ema-pruned.ckpt
Global Step: 110000
LatentDiffusion: Running in v-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
Sampling:   0%|                                                                                                                                                                      | 0/3 [00:00<?, ?it/sData shape for DDIM sampling is (3, 4, 96, 96), eta 0.0                                                                                                                               | 0/1 [00:00<?, ?it/s]
Running DDIM Sampling with 50 timesteps
DDIM Sampler:   0%|                                                                                                                                                                 | 0/50 [00:00<?, ?it/s]
data:   0%|                                                                                                                                                                          | 0/1 [00:01<?, ?it/s]
Sampling:   0%|                                                                                                                                                                      | 0/3 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
    main(opt)
  File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 347, in main
    samples, _ = sampler.sample(S=opt.steps,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 104, in sample
    samples, intermediates = self.ddim_sampling(conditioning, size,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 164, in ddim_sampling
    outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 212, in p_sample_ddim
    model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 1335, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 797, in forward
    h = module(h, emb, context)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 86, in forward
    x = layer(x)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

The machine I am on is a VPS with a lot of CPU cores and 64 GB memory but no GPU.. any ideas how to get it to run without cuda? Any help is much appreciated.

dmille commented 1 year ago

I was able to get it running on CPU by passing the --precision full flag as well as changing the use_fp16 parameter in the v2-inference.yaml from use_fp16: True to use_fp16: False. Specifically the model.params.unet_config.params.use_fp16 key in the yaml file.

note: I was using v2-inference.yaml not v2-inference-v.yaml.