AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
136.1k stars 25.96k forks source link

[Feature Request]: Add pytorch-directml Support to Boost The Efficiency on ANY Windows 10 with DirectX 12 #3756

Open SheepChef opened 1 year ago

SheepChef commented 1 year ago

Is there an existing issue for this?

What would your feature do ?

Microsoft have published a new torch version that is called pytorch-directml, which adds a new device called "dml" in pytorch and is able to run with any GPU (including AMD) on Windows 10 with DirectX 12.

I've noticed that dirctml have recently added the support of Python 3.10 and the newset version of Pytorch, however, there are still many problems ahead.

I am wondering if this feature can be added. If do so, the efficiency will be much improved on the Windows devices that only have AMD GPUs or Linux devices that AMD ROCm driver doesn't support.

For an instance, I compared the speed of CPU-Only and CUDA and DirectML in 512x512 picture generation with 20 steps:

CPU-Only: Around 6~9 minutes. CUDA: Within 10 seconds. DirectML: Within 10~30 seconds.

Thus it is evident that DirectML is at least 18 times faster than CPU-Only.

For pytorch-directml reference, check pytorch-with-directml

Proposed workflow

The feature is a fundamental change of the project. If all the dependency problems are solved, what to do is changing device "cuda" in to "dml".

parser.add_argument('--device', type=str, default='dml', help='The device to use for training.')

so the step is:

  1. Solve the dependency problems.
  2. Change device used in the code.

Additional information

If the feature is added, the GPU hardware threshold of the project will be much lower ! Please consider !

SheepChef commented 1 year ago

According to my brief investigation, there are few dependencies that aren't capable with torch 1.8.0

1.PyTorch Lightning (Solved by downgrading) 2.kornia (Solved by downgrading) 3.xformers (All releases require torch >= 12.0, so must be replaced with another substitute)

However, downgrading is not the final solution since it is very likely that the project used some functions which the old dependencies don't have.

Merramore commented 1 year ago

As of Dec 6, pytorch-directml is now v1.13 using an out-of-tree backend, which should make win10+dml possible with a bit of luck.

Edit: Seems some operations are still NYI.

Small-Ku commented 1 year ago

I think modifying get_optimal_device like this would get it works.

def get_optimal_device():
    if torch.cuda.is_available():
        return torch.device(get_cuda_device_string())

    try:
        import torch_directml
        return torch_directml.device()
    except Exception:
        pass

    if has_mps():
        return torch.device("mps")

    return cpu

I get error like https://github.com/microsoft/DirectML/issues/368 when trying to use it:

Error completing request
Arguments: ('task(bkmy03sy59eptz3)', '1boy', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 0, False, 'x264', 'mci', 10, 0, False, True, True, True, 'intermediate', 'animation', False, False, False, False, '', 1, '', 0, '', 0, '', True, False, False, False) {}
Traceback (most recent call last):
  File "D:\dev\stable-diffusion-webui\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "D:\dev\stable-diffusion-webui\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "D:\dev\stable-diffusion-webui\modules\txt2img.py", line 52, in txt2img
    processed = process_images(p)
  File "D:\dev\stable-diffusion-webui\modules\processing.py", line 485, in process_images
    res = process_images_inner(p)
  File "D:\dev\stable-diffusion-webui\modules\processing.py", line 627, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "D:\dev\stable-diffusion-webui\modules\processing.py", line 822, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 544, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling
    return func()
  File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 544, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 337, in forward
    x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
  File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 72, in sigma_to_t
    low_idx = dists.ge(0).cumsum(dim=0).argmax(dim=0).clamp(max=self.log_sigmas.shape[0] - 2)
RuntimeError

It seems that use DirectML would need many workarounds to make it work in its current state.

RysiekMC commented 1 year ago

At first sorry for my bad language - I'm from Poland I found the workaround that worked for me (Intel UHD 620)

In external.py replace sigma_to_t function with this code:

def sigma_to_t(self, sigma, quantize=None): quantize = self.quantize if quantize is None else quantize dists = torch.abs(sigma - self.sigmas[:, None]) if quantize: return torch.argmin(dists, dim=0).view(sigma.shape) low_idx, high_idx = torch.sort(torch.topk(dists, dim=0, k=2, largest=False).indices, dim=0)[0] low, high = self.sigmas[low_idx], self.sigmas[high_idx] w = (low - sigma) / (low - high) w = w.clamp(0, 1) t = (1 - w) * low_idx + w * high_idx return t.view(sigma.shape)

And part of my devices.py - not necessarily:

import torch_directml dml = torch_directml.device() def get_optimal_device(): if torch_directml.is_available(): return torch.device(dml) if has_mps(): return torch.device("mps") return cpu

I found that DirectML support is very broken now because after image generation memory is not freed and i have to restart webui (this is a temporary solution)

Tell me if it helped

EDIT: This is my first post at github. I'm attaching files if needed(change format from .txt to .py) external.py external.txt

devices.py devices.txt

Kazaflow commented 1 year ago

RysiekMC's changes work, kind of. I got it to run on my rx 5700 xt, but it runs out of VRAM very quickly, and as said doesn't free up VRAM, but its speed is worth it over running on CPU only. Could be that the VRAM issue is related to this error I get on initialization: "Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled"

RysiekMC commented 1 year ago

This error is because torch-DirectML uses pytorch implementation compiled for CPU. Another thing is that there is a problem that after generating a single image I have to restart the webui server to free up graphics memory. I think this is due to webui being currently only for cuda processors and requires more modifications not only of webui itself but also of other repositories it uses.

Kazaflow commented 1 year ago

I noticed something interesting, passing --no-half --no-half-vae --medvram, takes away the need to restart the UI for me, though it still uses all of my 8GB VRAM with a few GIGs from the RAM. But some samplers don't work because of " Script RuntimeError: Device type PRIVATEUSEONE is not supported for torch.Generator() api." Which again is a directML problem.

GoForceX commented 1 year ago

It works, but not entirely on my rx6700xt --no-half --no-half-vae --medvram works, but vram ran out quickly Using --lowvram is fine without restarting but the generating time is lots more longer. not using these flags will result in error

Log there
Error completing request
Arguments: ('', '', 'None', 'None', 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 0, 0, 0, False, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'Refresh models', False, False, False, False, '', '', 1, '', 0, '', True, False, False) {}
Traceback (most recent call last):
  File "H:\stable-diffusion-webui\modules\call_queue.py", line 45, in f
    res = list(func(*args, **kwargs))
  File "H:\stable-diffusion-webui\modules\call_queue.py", line 28, in f
    res = func(*args, **kwargs)
  File "H:\stable-diffusion-webui\modules\txt2img.py", line 49, in txt2img
    processed = process_images(p)
  File "H:\stable-diffusion-webui\modules\processing.py", line 470, in process_images
    res = process_images_inner(p)
  File "H:\stable-diffusion-webui\modules\processing.py", line 575, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "H:\stable-diffusion-webui\modules\processing.py", line 707, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 527, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 439, in launch_sampling
    return func()
  File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 527, in 
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 337, in forward
    x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 110, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 136, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1329, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 768, in forward
    emb = self.time_embed(t_emb)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\container.py", line 204, in forward
    input = module(input)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype
  
Kazaflow commented 1 year ago

As far as I know the error happens because you try to use half precision, which I think only some AMD gpu's can do for some reason. EDIT: Using --no-half-vae is not necessary, at least from my testing.

reid3333 commented 1 year ago

When --no-half is specified, --no-half-vae is automatically enabled as well. https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/sd_models.py#L274

To avoid the above RuntimeError, it is necessary to specify --precision full. Because DirectML does not support torch.autocast (automatic mixed precision) at this time. https://github.com/microsoft/DirectML/issues/192

lshqqytiger commented 1 year ago

I rewrote optimization code for AMD GPUs and found optimal settings for RX 5700 XT. --no half --precision full --opt-sub-quad-attention 23sec, 1.18s/it (Euler, 512x768, 20 steps)

Others: --no-half --precision full Out of Memory --no-half --precision full --opt-sub-quad-attention --medvram 26sec, 1.31s/it

  1. --opt-sub-quad-attention was more effective than --opt-split-attention.
  2. When I tried to avoid the RuntimeError: mat1 and mat2 must have the same dtype error in the wrong way, it was rather slower. Some reports I received from some RX 6000 series users showed some performance improvements, but I couldn't confirm because I don't have RX 6000 series card.
majorsauce commented 1 year ago

@lshqqytiger did you also try training ? With your fork, text2img and img2img work great on my R9 390 but I receive RuntimeErrors on autograd backward function and the error does not specify a description so I only have the stack trace to work with. I also tried to run without --no-half but then it breaks at some other point.

lshqqytiger commented 1 year ago

I tried hypernetwork with some images, but I couldn't produce same error due to out of memory error.

majorsauce commented 1 year ago

@lshqqytiger thank you for the response. If you want to have a look into it I could raise an issue on your fork with the detailled info and stack trace but therefor you´d need to enable issues on it.

lshqqytiger commented 1 year ago

I enabled it now. Trackback will help me avoid such error.

SheepChef commented 1 year ago

Thanks deeply for concerning the issue! If the usage of dirctml can be realized, as I mentioned, the speed will be increased significantly. It's a pity that I can't collaborate with you due to my academic tasks and my poor AI development skill. However, I am always looking forward, so carry on the good work!