Open SheepChef opened 1 year ago
According to my brief investigation, there are few dependencies that aren't capable with torch 1.8.0
1.PyTorch Lightning (Solved by downgrading) 2.kornia (Solved by downgrading) 3.xformers (All releases require torch >= 12.0, so must be replaced with another substitute)
However, downgrading is not the final solution since it is very likely that the project used some functions which the old dependencies don't have.
I think modifying get_optimal_device
like this would get it works.
def get_optimal_device():
if torch.cuda.is_available():
return torch.device(get_cuda_device_string())
try:
import torch_directml
return torch_directml.device()
except Exception:
pass
if has_mps():
return torch.device("mps")
return cpu
I get error like https://github.com/microsoft/DirectML/issues/368 when trying to use it:
Error completing request
Arguments: ('task(bkmy03sy59eptz3)', '1boy', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 0, False, 'x264', 'mci', 10, 0, False, True, True, True, 'intermediate', 'animation', False, False, False, False, '', 1, '', 0, '', 0, '', True, False, False, False) {}
Traceback (most recent call last):
File "D:\dev\stable-diffusion-webui\modules\call_queue.py", line 56, in f
res = list(func(*args, **kwargs))
File "D:\dev\stable-diffusion-webui\modules\call_queue.py", line 37, in f
res = func(*args, **kwargs)
File "D:\dev\stable-diffusion-webui\modules\txt2img.py", line 52, in txt2img
processed = process_images(p)
File "D:\dev\stable-diffusion-webui\modules\processing.py", line 485, in process_images
res = process_images_inner(p)
File "D:\dev\stable-diffusion-webui\modules\processing.py", line 627, in process_images_inner
samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
File "D:\dev\stable-diffusion-webui\modules\processing.py", line 822, in sample
samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 544, in sample
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling
return func()
File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 544, in <lambda>
samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "D:\dev\stable-diffusion-webui\modules\sd_samplers.py", line 337, in forward
x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
File "D:\dev\stable-diffusion-webui\venv-dml\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "D:\dev\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 72, in sigma_to_t
low_idx = dists.ge(0).cumsum(dim=0).argmax(dim=0).clamp(max=self.log_sigmas.shape[0] - 2)
RuntimeError
It seems that use DirectML would need many workarounds to make it work in its current state.
At first sorry for my bad language - I'm from Poland I found the workaround that worked for me (Intel UHD 620)
In external.py replace sigma_to_t function with this code:
def sigma_to_t(self, sigma, quantize=None): quantize = self.quantize if quantize is None else quantize dists = torch.abs(sigma - self.sigmas[:, None]) if quantize: return torch.argmin(dists, dim=0).view(sigma.shape) low_idx, high_idx = torch.sort(torch.topk(dists, dim=0, k=2, largest=False).indices, dim=0)[0] low, high = self.sigmas[low_idx], self.sigmas[high_idx] w = (low - sigma) / (low - high) w = w.clamp(0, 1) t = (1 - w) * low_idx + w * high_idx return t.view(sigma.shape)
And part of my devices.py - not necessarily:
import torch_directml dml = torch_directml.device() def get_optimal_device(): if torch_directml.is_available(): return torch.device(dml) if has_mps(): return torch.device("mps") return cpu
I found that DirectML support is very broken now because after image generation memory is not freed and i have to restart webui (this is a temporary solution)
Tell me if it helped
EDIT: This is my first post at github. I'm attaching files if needed(change format from .txt to .py) external.py external.txt
devices.py devices.txt
RysiekMC's changes work, kind of. I got it to run on my rx 5700 xt, but it runs out of VRAM very quickly, and as said doesn't free up VRAM, but its speed is worth it over running on CPU only. Could be that the VRAM issue is related to this error I get on initialization: "Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled"
This error is because torch-DirectML uses pytorch implementation compiled for CPU. Another thing is that there is a problem that after generating a single image I have to restart the webui server to free up graphics memory. I think this is due to webui being currently only for cuda processors and requires more modifications not only of webui itself but also of other repositories it uses.
I noticed something interesting, passing --no-half --no-half-vae --medvram, takes away the need to restart the UI for me, though it still uses all of my 8GB VRAM with a few GIGs from the RAM. But some samplers don't work because of " Script RuntimeError: Device type PRIVATEUSEONE is not supported for torch.Generator() api." Which again is a directML problem.
It works, but not entirely on my rx6700xt --no-half --no-half-vae --medvram works, but vram ran out quickly Using --lowvram is fine without restarting but the generating time is lots more longer. not using these flags will result in error
Error completing request Arguments: ('', '', 'None', 'None', 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 0, 0, 0, False, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'LoRA', 'None', 1, 'Refresh models', False, False, False, False, '', '', 1, '', 0, '', True, False, False) {} Traceback (most recent call last): File "H:\stable-diffusion-webui\modules\call_queue.py", line 45, in f res = list(func(*args, **kwargs)) File "H:\stable-diffusion-webui\modules\call_queue.py", line 28, in f res = func(*args, **kwargs) File "H:\stable-diffusion-webui\modules\txt2img.py", line 49, in txt2img processed = process_images(p) File "H:\stable-diffusion-webui\modules\processing.py", line 470, in process_images res = process_images_inner(p) File "H:\stable-diffusion-webui\modules\processing.py", line 575, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "H:\stable-diffusion-webui\modules\processing.py", line 707, in sample samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x)) File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 527, in sample samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={ File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 439, in launch_sampling return func() File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 527, insamples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={ File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral denoised = model(x, sigmas[i] * s_in, **extra_args) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "H:\stable-diffusion-webui\modules\sd_samplers.py", line 337, in forward x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]}) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 110, in forward eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs) File "H:\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 136, in get_eps return self.inner_model.apply_model(*args, **kwargs) File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model x_recon = self.model(x_noisy, t, **cond) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1329, in forward out = self.diffusion_model(x, t, context=cc) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "H:\stable-diffusion-webui\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 768, in forward emb = self.time_embed(t_emb) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\container.py", line 204, in forward input = module(input) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "D:\Users\GoForceX\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 must have the same dtype
As far as I know the error happens because you try to use half precision, which I think only some AMD gpu's can do for some reason. EDIT: Using --no-half-vae is not necessary, at least from my testing.
When --no-half
is specified, --no-half-vae
is automatically enabled as well.
https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/master/modules/sd_models.py#L274
To avoid the above RuntimeError, it is necessary to specify --precision full
.
Because DirectML does not support torch.autocast
(automatic mixed precision) at this time.
https://github.com/microsoft/DirectML/issues/192
I rewrote optimization code for AMD GPUs and found optimal settings for RX 5700 XT.
--no half --precision full --opt-sub-quad-attention
23sec, 1.18s/it (Euler, 512x768, 20 steps)
Others:
--no-half --precision full
Out of Memory
--no-half --precision full --opt-sub-quad-attention --medvram
26sec, 1.31s/it
--opt-split-attention
: Out of Memory
--opt-sub-quad-attention
: Out of Memory
--opt-sub-quad-attention --medvram
: Slower processing and weird result.
--precision full --opt-sub-quad-attention
: Out of Memory
--precision full --opt-sub-quad-attention --medvram
: Weird result.--opt-split-attention
: Out of Memory
--opt-sub-quad-attention
: Out of Memory
--opt-sub-quad-attention --medvram
: Slower processing and normal result.
--precision full --opt-sub-quad-attention
: Out of Memory
--precision full --opt-sub-quad-attention --medvram
: 24sec, 1.24s/it and normal result.--opt-sub-quad-attention
was more effective than --opt-split-attention
.RuntimeError: mat1 and mat2 must have the same dtype
error in the wrong way, it was rather slower. Some reports I received from some RX 6000 series users showed some performance improvements, but I couldn't confirm because I don't have RX 6000 series card.@lshqqytiger did you also try training ? With your fork, text2img and img2img work great on my R9 390 but I receive RuntimeErrors on autograd backward function and the error does not specify a description so I only have the stack trace to work with. I also tried to run without --no-half but then it breaks at some other point.
I tried hypernetwork with some images, but I couldn't produce same error due to out of memory error.
@lshqqytiger thank you for the response. If you want to have a look into it I could raise an issue on your fork with the detailled info and stack trace but therefor you´d need to enable issues on it.
I enabled it now. Trackback will help me avoid such error.
Thanks deeply for concerning the issue! If the usage of dirctml can be realized, as I mentioned, the speed will be increased significantly. It's a pity that I can't collaborate with you due to my academic tasks and my poor AI development skill. However, I am always looking forward, so carry on the good work!
Is there an existing issue for this?
What would your feature do ?
Microsoft have published a new torch version that is called pytorch-directml, which adds a new device called "dml" in pytorch and is able to run with any GPU (including AMD) on Windows 10 with DirectX 12.
I've noticed that dirctml have recently added the support of Python 3.10 and the newset version of Pytorch, however, there are still many problems ahead.
I am wondering if this feature can be added. If do so, the efficiency will be much improved on the Windows devices that only have AMD GPUs or Linux devices that AMD ROCm driver doesn't support.
For an instance, I compared the speed of CPU-Only and CUDA and DirectML in 512x512 picture generation with 20 steps:
CPU-Only: Around 6~9 minutes. CUDA: Within 10 seconds. DirectML: Within 10~30 seconds.
Thus it is evident that DirectML is at least 18 times faster than CPU-Only.
For pytorch-directml reference, check pytorch-with-directml
Proposed workflow
The feature is a fundamental change of the project. If all the dependency problems are solved, what to do is changing device "cuda" in to "dml".
parser.add_argument('--device', type=str, default='dml', help='The device to use for training.')
so the step is:
Additional information
If the feature is added, the GPU hardware threshold of the project will be much lower ! Please consider !