Open LynxPDA opened 1 year ago
bf16 can roughly double the CPU mode speedup, which is predictable. But how many people will benefit from this change is a question. So far, WebUI's support for bf16 is basically blank.
I get this error when loading a random SD 2.1 model and SD XL 1.0 base
model:
version: [v1.6.0] • python: 3.11.2 • torch: 2.1.0+cpu • xformers: N/A • gradio: 3.41.2
Loading weights [e6bb9ea85b] from /pool/dev/sd-web/stable-diffusion-webui/models/Stable-diffusion/SD XL1.0/sdXL_v10VAEFix.safetensors
Creating model from config: /pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Applying attention optimization: InvokeAI... done.
changing setting sd_model_checkpoint to SD XL1.0/sdXL_v10VAEFix.safetensors [e6bb9ea85b]: RuntimeError
Traceback (most recent call last):
File "/pool/dev/sd-web/stable-diffusion-webui/modules/options.py", line 140, in set
option.onchange()
File "/pool/dev/sd-web/stable-diffusion-webui/modules/call_queue.py", line 13, in f
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/initialize_util.py", line 170, in <lambda>
shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 752, in reload_model_weights
load_model(checkpoint_info, already_loaded_state_dict=state_dict)
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 650, in load_model
sd_model.cond_stage_model_empty_prompt = get_empty_cond(sd_model)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 535, in get_empty_cond
d = sd_model.get_learned_conditioning([""])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models_xl.py", line 31, in get_learned_conditioning
c = self.conditioner(sdxl_conds, force_zero_embeddings=['txt'] if force_zero_negative_prompt else [])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 141, in forward
emb_out = embedder(batch[embedder.input_key])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 234, in forward
z = self.process_tokens(tokens, multipliers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 273, in process_tokens
z = self.encode_with_transformers(tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_open_clip.py", line 57, in encode_with_transformers
d = self.wrapped.encode_with_transformer(tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 470, in encode_with_transformer
x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 502, in text_transformer_forward
x = r(x, attn_mask=attn_mask)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 242, in forward
x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 228, in attention
return self.attn(
^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 486, in network_MultiheadAttention_forward
return originals.MultiheadAttention_forward(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1241, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/functional.py", line 5440, in multi_head_attention_forward
attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and query.dtype: c10::BFloat16 instead.
After doing some searching, I guess it's come from this bug:
There is a workaround here. Edit file $(your-stable-diffusion-webui-repo-path)/venv/lib/$(your-python-version)/site-packages/open_clip/transformer.py
in library open_clip
:
def attention(
self,
q_x: torch.Tensor,
k_x: Optional[torch.Tensor] = None,
v_x: Optional[torch.Tensor] = None,
attn_mask: Optional[torch.Tensor] = None,
):
k_x = k_x if k_x is not None else q_x
v_x = v_x if v_x is not None else q_x
- attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+ if torch.is_autocast_cpu_enabled():
+ attn_mask = attn_mask.to(torch.get_autocast_cpu_dtype())
+ else:
+ attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
return self.attn(
q_x, k_x, v_x, need_weights=False, attn_mask=attn_mask
)[0]
Fix this:
I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model
Other effects have not been tested.
Also I noticed that the memory usage doubled, which is weird because shouldn't it be halved? (because of half precision)
I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image. Why we still use --no-half if we want a half?
Now it started with 630s/it instead of 15s/it XD
Why we still use --no-half if we want a half?
It's just a dirty hack to make sure other code keeps working.
I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.
Now it started with 630s/it instead of 15s/it XD
I guess it depends on your hardware support and pytorch support.
I guess it depends on your hardware support and pytorch support.
Intel i7-10710U
I guess it depends on your hardware support and pytorch support.
Intel i7-10710U
Sadly, it looks like your hardware doesn't support avx512 and bfloat16.
References:
AVX-512 BFloat16 Instructions (BF16) - x86
AVX-512 BFloat16 Instructions (AVX512_BF16) is an x86 extension, part of AVX-512, designed to accelerate neural network-based algorithms by performing dot-product on bfloat16.
Automatic Mixed Precision package
For CPU, only lower precision floating point datatype of
torch.bfloat16
is supported for now.
So, I moved to openvino and now the speed is tripple (5s/it) Maybe will be helpful for other intel CPU/GPU users
Is there an existing issue for this?
What would your feature do ?
Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.
By using autocast bfloat16 I doubled the performance.
Proposed workflow
Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.
Additional information
Other system informations:
COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"
python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978
OS Ubuntu 22.04
P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.
I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.