[Bug]: Win10, multiple GPU, cannot do parallel generation

coollofty commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I had 4 A4000 16GB GPU, use openjourney model (about 2GB)

I tried setting CUDA_VISIBLE_DEVICES and --device-id (in COMMANDLINE_ARGS or append to %PYTHON% launcher.py) or use CUDA_VISIBLE_DEVICES alone.

There is no problem during startup. When webui.bat print "Running on local URL: http://0.0.0.0:7861", in the task management window it can be seen that the memory of the four GPUs has been allocated. Since the same model is used, the memory occupied by the four GPUs is the same, which is 3.4GB

Then I launched 4 pages in my browser and visited 127.0.0.1:786[0-3] respectively, 4 pages display normally. Then I typed a prompt and pressed the generate button on all four pages as quickly as possible.

Steps to reproduce the problem

open 4 command window, and put them: 1: set CUDA_VISIBLE_DEVICES=0 set COMMANDLINE_ARGS=--device-id 0 webui.bat 2: set CUDA_VISIBLE_DEVICES=1 set COMMANDLINE_ARGS=--device-id 1 webui.bat 3:..........

What should have happened?

Only one can operate normally, and the other three will report the same error, and each time the three error command window are different

Commit where the problem happens

master

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

webui.bat

List of extensions

DreamBoth, OpenPose, ControlNet

Console logs

Error completing request
Arguments: ('task(hxfslbor3yzpr4z)', 'an image', '', [], 24, 15, False, False, 1, 4, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, [], 0, False, False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, 'none', 'None', 1, None, False, 'Scale to Fit (Inner Fit)', False, False, 64, 64, 64, 1, False, False, False, 'positive', 'comma', 0, False, False, '', 1, '', 0, '', 0, '', True, False, False, False, 0, '', 'None', 30, 4, 0, 0, False, 'None', '<br>', 'None', 30, 4, 0, 0, 4, 0.4, True, 32) {}
Traceback (most recent call last):
  File "D:\stable-diffusion\modules\call_queue.py", line 56, in f
    res = list(func(*args, **kwargs))
  File "D:\stable-diffusion\modules\call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "D:\stable-diffusion\modules\txt2img.py", line 56, in txt2img
    processed = process_images(p)
  File "D:\stable-diffusion\modules\processing.py", line 486, in process_images
    res = process_images_inner(p)
  File "D:\stable-diffusion\modules\processing.py", line 632, in process_images_inner
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts)
  File "D:\stable-diffusion\modules\processing.py", line 832, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "D:\stable-diffusion\modules\sd_samplers_kdiffusion.py", line 349, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\stable-diffusion\modules\sd_samplers_kdiffusion.py", line 225, in launch_sampling
    return func()
  File "D:\stable-diffusion\modules\sd_samplers_kdiffusion.py", line 349, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "D:\stable-diffusion\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\stable-diffusion\repositories\k-diffusion\k_diffusion\sampling.py", line 594, in sample_dpmpp_2m
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\modules\sd_samplers_kdiffusion.py", line 117, in forward
    x_out = self.inner_model(x_in, sigma_in, cond={"c_crossattn": [cond_in], "c_concat": [image_cond_in]})
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "D:\stable-diffusion\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "D:\stable-diffusion\modules\sd_hijack_utils.py", line 17, in <lambda>
    setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs))
  File "D:\stable-diffusion\modules\sd_hijack_utils.py", line 28, in __call__
    return self.__orig_func(*args, **kwargs)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\models\diffusion\ddpm.py", line 1329, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\extensions\sd-webui-controlnet\scripts\hook.py", line 190, in forward2
    return forward(*args, **kwargs)
  File "D:\stable-diffusion\extensions\sd-webui-controlnet\scripts\hook.py", line 160, in forward
    h = module(h, emb, context)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\openaimodel.py", line 84, in forward
    x = layer(x, context)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 324, in forward
    x = block(x, context=context[i])
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 259, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 114, in checkpoint
    return CheckpointFunction.apply(func, len(inputs), *args)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\diffusionmodules\util.py", line 129, in forward
    output_tensors = ctx.run_function(*ctx.input_tensors)
  File "D:\stable-diffusion\repositories\stable-diffusion-stability-ai\ldm\modules\attention.py", line 262, in _forward
    x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
  File "D:\stable-diffusion\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\modules\sd_hijack_optimizations.py", line 127, in split_cross_attention_forward
    s1 = einsum('b i d, b j d -> b i j', q[:, i:end], k)
  File "D:\stable-diffusion\venv\lib\site-packages\torch\functional.py", line 378, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.99 GiB total capacity; 2.21 GiB already allocated; 11.69 GiB free; 2.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

Just now, I tried. The phenomenon of using only 2 GPUs is the same as using only 3 GPUs. Only one of them can always operate normally

No response

midcoastal commented 1 year ago

AFAIK: Not a bug. It doesn't... At least, I haven't been able to get it to use multi-GPU, even with Accelerate.

Can anyone confirm?

coollofty commented 1 year ago

Not a Bug? maybe....

But why memory not enough? each card have 16GB memory, and at least 10GB memory not used when error occured.

I think the only reasonable explanation is that some codes do not implement deviceid settings and still use the default card, so the memory outted...

AUTOMATIC1111 / stable-diffusion-webui