[Feature Request]: Add support for autocast bfloat16 for generate on the latest CPUs

LynxPDA commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Many modern processors have bfloat16 support such as AMD Zen4, Apple M2, Intel Cooper Lake, Intel Sapphire Rapids.

By using autocast bfloat16 I doubled the performance.

Ryzen 9 7950X (32Gb) speedup from 0.625 it/s to 1.3 it/s

Proposed workflow

Change in ./modules/devices.py

Add return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) in autocast functions.

def autocast(disable=False):
    from modules import shared

    if disable:
        contextlib.nullcontext()

    if dtype == torch.float32 or shared.cmd_opts.precision == "full":
        return torch.autocast(enabled=True, dtype= torch.bfloat16, device_type='cpu', cache_enabled=True) # contextlib.nullcontext()

    return torch.autocast("cuda")

Change in ./modules/sd_samplers_common.py Add if x_sample.dtype == torch.bfloat16: x_sample = x_sample.to(torch.float16) in single_sample_to_image, because numpy dont work with bfloat16 yet.

def single_sample_to_image(sample, approximation=None):
    if approximation is None:
        approximation = approximation_indexes.get(opts.show_progress_type, 0)

    if approximation == 2:
        x_sample = sd_vae_approx.cheap_approximation(sample)
    elif approximation == 1:
        x_sample = sd_vae_approx.model()(sample.to(devices.device, devices.dtype).unsqueeze(0))[0].detach()
    else:
        x_sample = processing.decode_first_stage(shared.sd_model, sample.unsqueeze(0))[0]

    x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
    if x_sample.dtype == torch.bfloat16:
        x_sample = x_sample.to(torch.float16)
    x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
    x_sample = x_sample.astype(np.uint8)
    return Image.fromarray(x_sample)

Additional information

Other system informations:

COMMANDLINE_ARGS="--precision autocast --use-cpu all --no-half --opt-channelslast --skip-torch-cuda-test --enable-insecure-extension-access"

python: 3.10.6 • torch: 2.1.0.dev20230506+cpu • xformers: N/A • gradio: 3.28.1 • commit: 5ab7f213 • checkpoint: b4391b7978

OS Ubuntu 22.04

P.S. Since I'm still just a beginner programmer, the changes were made only as a proof of concept.

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model, the rest of the functionality worked at 2X increase in speed.

Sakura-Luna commented 1 year ago

bf16 can roughly double the CPU mode speedup, which is predictable. But how many people will benefit from this change is a question. So far, WebUI's support for bf16 is basically blank.

CatEricka commented 11 months ago

I get this error when loading a random SD 2.1 model and SD XL 1.0 base model:

version: [v1.6.0] • python: 3.11.2 • torch: 2.1.0+cpu • xformers: N/A • gradio: 3.41.2

Loading weights [e6bb9ea85b] from /pool/dev/sd-web/stable-diffusion-webui/models/Stable-diffusion/SD XL1.0/sdXL_v10VAEFix.safetensors
Creating model from config: /pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/configs/inference/sd_xl_base.yaml
Applying attention optimization: InvokeAI... done.
changing setting sd_model_checkpoint to SD XL1.0/sdXL_v10VAEFix.safetensors [e6bb9ea85b]: RuntimeError
Traceback (most recent call last):
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/options.py", line 140, in set
    option.onchange()
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/call_queue.py", line 13, in f
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/initialize_util.py", line 170, in <lambda>
    shared.opts.onchange("sd_model_checkpoint", wrap_queued_call(lambda: sd_models.reload_model_weights()), call=False)
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 752, in reload_model_weights
    load_model(checkpoint_info, already_loaded_state_dict=state_dict)
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 650, in load_model
    sd_model.cond_stage_model_empty_prompt = get_empty_cond(sd_model)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models.py", line 535, in get_empty_cond
    d = sd_model.get_learned_conditioning([""])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_models_xl.py", line 31, in get_learned_conditioning
    c = self.conditioner(sdxl_conds, force_zero_embeddings=['txt'] if force_zero_negative_prompt else [])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 141, in forward
    emb_out = embedder(batch[embedder.input_key])
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 234, in forward
    z = self.process_tokens(tokens, multipliers)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_clip.py", line 273, in process_tokens
    z = self.encode_with_transformers(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/modules/sd_hijack_open_clip.py", line 57, in encode_with_transformers
    d = self.wrapped.encode_with_transformer(tokens)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 470, in encode_with_transformer
    x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/repositories/generative-models/sgm/modules/encoders/modules.py", line 502, in text_transformer_forward
    x = r(x, attn_mask=attn_mask)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 242, in forward
    x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/open_clip/transformer.py", line 228, in attention
    return self.attn(
           ^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 486, in network_MultiheadAttention_forward
    return originals.MultiheadAttention_forward(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1241, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pool/dev/sd-web/stable-diffusion-webui/venv_cpu/lib/python3.11/site-packages/torch/nn/functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and  query.dtype: c10::BFloat16 instead.

After doing some searching, I guess it's come from this bug:

https://github.com/pytorch/pytorch/issues/99012

CatEricka commented 11 months ago

There is a workaround here. Edit file $(your-stable-diffusion-webui-repo-path)/venv/lib/$(your-python-version)/site-packages/open_clip/transformer.py in library open_clip:

    def attention(
            self,
            q_x: torch.Tensor,
            k_x: Optional[torch.Tensor] = None,
            v_x: Optional[torch.Tensor] = None,
            attn_mask: Optional[torch.Tensor] = None,
    ):
        k_x = k_x if k_x is not None else q_x
        v_x = v_x if v_x is not None else q_x

-        attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
+        if torch.is_autocast_cpu_enabled():
+            attn_mask = attn_mask.to(torch.get_autocast_cpu_dtype())
+        else:
+            attn_mask = attn_mask.to(q_x.dtype) if attn_mask is not None else None
        return self.attn(
            q_x, k_x, v_x, need_weights=False, attn_mask=attn_mask
        )[0]

Fix this:

I was able to check the main functionality and practically and got a generation error only with the Stable Diffusion 2.1 model

Other effects have not been tested.

~~Also I noticed that the memory usage doubled, which is weird because shouldn't it be halved? (because of half precision)~~

sebaxakerhtc commented 8 months ago

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image. Why we still use --no-half if we want a half?

Now it started with 630s/it instead of 15s/it XD

CatEricka commented 8 months ago

Why we still use --no-half if we want a half?

It's just a dirty hack to make sure other code keeps working.

I try to reproduce it on MacOS and for me it stack at 0% when I start generation of an image.

Now it started with 630s/it instead of 15s/it XD

I guess it depends on your hardware support and pytorch support.

sebaxakerhtc commented 8 months ago

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

CatEricka commented 8 months ago

I guess it depends on your hardware support and pytorch support.

Intel i7-10710U

Sadly, it looks like your hardware doesn't support avx512 and bfloat16.

References:

AVX-512 BFloat16 Instructions (BF16) - x86

AVX-512 BFloat16 Instructions (AVX512_BF16) is an x86 extension, part of AVX-512, designed to accelerate neural network-based algorithms by performing dot-product on bfloat16.

Automatic Mixed Precision package

For CPU, only lower precision floating point datatype of torch.bfloat16 is supported for now.

sebaxakerhtc commented 8 months ago

So, I moved to openvino and now the speed is tripple (5s/it) Maybe will be helpful for other intel CPU/GPU users

AUTOMATIC1111 / stable-diffusion-webui