AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
141.77k stars 26.78k forks source link

MPS backend out of memory #9133

Open fangyinzhe opened 1 year ago

fangyinzhe commented 1 year ago

Is there an existing issue for this?

What happened?

MacOS,已经顺利进入http://127.0.0.1:7860/网站,但是生成图片出现这个错误 RuntimeError: MPS backend out of memory (MPS allocated: 5.05 GB, other allocations: 2.29 GB, max allowed: 6.77 GB). Tried to allocate 1024.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure)

Steps to reproduce the problem

安装MPS

What should have happened?

RuntimeError: MPS backend out of memory (MPS allocated: 5.05 GB, other allocations: 2.29 GB, max allowed: 6.77 GB). Tried to allocate 1024.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure)

Commit where the problem happens

A python: 3.10.10  •  torch: 1.12.1  •  xformers: N/A  •  gradio: 3.16.2  •  commit: 0cc0ee1b  •  checkpoint: bf864f41d5

What platforms do you use to access the UI ?

MacOS

What browsers do you use to access the UI ?

Apple Safari

Command Line Arguments

NO

List of extensions

NO

Console logs

RuntimeError: MPS backend out of memory (MPS allocated: 5.05 GB, other allocations: 2.29 GB, max allowed: 6.77 GB). Tried to allocate 1024.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure)

Additional information

No response

elisezhu123 commented 1 year ago

8gb version of mac is not enought to have mps accerlating and pytorch 2.0 or mps is only work in 13+ version

fangyinzhe commented 1 year ago

inter i7 core 16G

pudepiedj commented 1 year ago

I have also experienced this runtime error while running the open-source version of Whisper on a 2019 Macbook built on an Intel i9 8-core CPU with 16GB RAM and an AMD Radeon Pro 5500M.

I had previously been running a decoder simulation that runs perfectly on Google Colab, which is when the error we've both experienced first appeared, but reducing batch sizes massively made no difference to the error, which then started appearing in Whisper runs on audio files of negligible size. So I concluded that it wasn't really a memory error at all, whatever the error message may say.

However, I extracted the Whisper code to another Jupyter Notebook and it ran perfectly on the GPU using the latest releases from Apple and PyTorch on Ventura macOS 13.3, with 13.0, as @elisezhu123 says , the minimum requirement. So the problem has "gone away" rather than being solved, but I'd suggest just rerunning your code in another clean notebook as a first step. The suggested "fix" with the environment variable is dangerous, and probably unnecessary, but if you do use it I'd try setting it to another value than 0.0; I think the default is 0.7, i.e. the GPU can use 70% memory, so maybe raise it a bit, but I really don't think memory is the problem; there's a "glitch" somewhere that changing notebooks fixes. Obviously very happy to be corrected on this if I am mistaken.

fangyinzhe commented 1 year ago

So I can only switch to another computer, right?

pudepiedj commented 1 year ago

No - misunderstanding of "notebook". I meant that changing the code to another Jupyter (Anaconda3) notebook (not another physical Mac notebook) sorted the problem out for me, but since writing that it has come back again, so I am not sure that what I did solved it at all. There are some suggestions elsewhere that there may be an issue with MacOS Ventura 13.3 but I am not in a position to explore that.

elisezhu123 commented 1 year ago

I have also experienced this runtime error while running the open-source version of Whisper on a 2019 Macbook built on an Intel i9 8-core CPU with 16GB RAM and an AMD Radeon Pro 5500M.

I had previously been running a decoder simulation that runs perfectly on Google Colab, which is when the error we've both experienced first appeared, but reducing batch sizes massively made no difference to the error, which then started appearing in Whisper runs on audio files of negligible size. So I concluded that it wasn't really a memory error at all, whatever the error message may say.

However, I extracted the Whisper code to another Jupyter Notebook and it ran perfectly on the GPU using the latest releases from Apple and PyTorch on Ventura macOS 13.3, with 13.0, as @elisezhu123 says , the minimum requirement. So the problem has "gone away" rather than being solved, but I'd suggest just rerunning your code in another clean notebook as a first step. The suggested "fix" with the environment variable is dangerous, and probably unnecessary, but if you do use it I'd try setting it to another value than 0.0; I think the default is 0.7, i.e. the GPU can use 70% memory, so maybe raise it a bit, but I really don't think memory is the problem; there's a "glitch" somewhere that changing notebooks fixes. Obviously very happy to be corrected on this if I am mistaken.

it is just the bug of 13.3… 13.2 works

GrinZero commented 1 year ago

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

vanilladucky commented 1 year ago

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best.

After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go.

Hope this helps.

stephanebdc commented 1 year ago

Same Problem here, any solution? running from transformers import Blip2Processor, Blip2ForConditionalGeneration import torch for Salesforce/blip2-opt-2.7b On macbook 2019 16GB ram i9 and the Radeon

honzajavorek commented 1 year ago

I'm experiencing this with the latest commit of automatic and PyTorch v2 on my M1 8 GB running on macOS Ventura 13.3.1 (a).

Click to see the stack trace ``` Traceback (most recent call last): File "/Users/honza/Projects/stable-diffusion-webui/modules/call_queue.py", line 57, in f res = list(func(*args, **kwargs)) File "/Users/honza/Projects/stable-diffusion-webui/modules/call_queue.py", line 37, in f res = func(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/modules/img2img.py", line 181, in img2img processed = process_images(p) File "/Users/honza/Projects/stable-diffusion-webui/modules/processing.py", line 515, in process_images res = process_images_inner(p) File "/Users/honza/Projects/stable-diffusion-webui/extensions/sd-webui-controlnet/scripts/batch_hijack.py", line 42, in processing_process_images_hijack return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/modules/processing.py", line 604, in process_images_inner p.init(p.all_prompts, p.all_seeds, p.all_subseeds) File "/Users/honza/Projects/stable-diffusion-webui/modules/processing.py", line 1084, in init self.init_latent = self.sd_model.get_first_stage_encoding(self.sd_model.encode_first_stage(image)) File "/Users/honza/Projects/stable-diffusion-webui/modules/sd_hijack_utils.py", line 17, in setattr(resolved_obj, func_path[-1], lambda *args, **kwargs: self(*args, **kwargs)) File "/Users/honza/Projects/stable-diffusion-webui/modules/sd_hijack_utils.py", line 26, in __call__ return self.__sub_func(self.__orig_func, *args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/modules/sd_hijack_unet.py", line 76, in first_stage_sub = lambda orig_func, self, x, **kwargs: orig_func(self, x.to(devices.dtype_vae), **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py", line 830, in encode_first_stage return self.first_stage_model.encode(x) File "/Users/honza/Projects/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/autoencoder.py", line 83, in encode h = self.encoder(x) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py", line 526, in forward h = self.down[i_level].block[i_block](hs[-1], temb) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py", line 131, in forward h = self.norm1(h) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward return F.group_norm( File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 2956, in native_group_norm out, mean, rstd = _normalize(input_reshaped, reduction_dims, eps) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 2914, in _normalize biased_var, mean = torch.var_mean( File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 2419, in var_mean m = mean(a, dim, keepdim) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 2373, in mean result = true_divide(result, nelem) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 220, in _fn result = fn(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 130, in _fn result = fn(**bound.arguments) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 926, in _ref return prim(a, b) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_refs/__init__.py", line 1619, in true_divide return prims.div(a, b) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_ops.py", line 287, in __call__ return self._op(*args, **kwargs or {}) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_prims/__init__.py", line 278, in _prim_impl meta(*args, **kwargs) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_prims/__init__.py", line 400, in _elementwise_meta return TensorMeta(device=device, shape=shape, strides=strides, dtype=dtype) File "/Users/honza/Projects/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_prims/__init__.py", line 256, in TensorMeta return torch.empty_strided(shape, strides, dtype=dtype, device=device) RuntimeError: MPS backend out of memory (MPS allocated: 4.13 GB, other allocations: 5.24 GB, max allowed: 9.07 GB). Tried to allocate 512 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure). ```

While normal image generation works, this often occurs if I'm trying to use control net, but not always. Couldn't really figure out what's the differentiator. I have almost all other apps closed to leave maximum RAM unused.

What are my options to avoid this? I've noticed @brkirch is posting to discussions about Apple performance and has a fork at https://github.com/brkirch/stable-diffusion-webui/ with 14 commits ahead. Is this something that could speed up my poor performance or solve the "MPS backend out of memory" problem? Will it be ever merged to upstream? 🤔

akamitoro commented 1 year ago

I also keep having this issue if if scale the images on my M1 8Gb Mac Mini.

akamitoro commented 1 year ago

anyway to work around the issue? would the recommended solution from the error help? and how to do it?

Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit

honzajavorek commented 1 year ago

This seems to help, at least in my case:

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half
rovo79 commented 1 year ago

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best.

After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go.

Hope this helps.

Where do you put that line of code? device = torch.device('mps')

pudepiedj commented 1 year ago

I recommend reading the very good documentation on the PyTorch website which has examples showing how to use the MPS device and how to load data onto it.https://pytorch.org/docs/stable/notes/mps.htmlSent from my iPhoneOn 12 May 2023, at 21:39, Robert Dean @.***> wrote:

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best. After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go. Hope this helps.

Where do you put that line of code? device = torch.device('mps')

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

honzajavorek commented 1 year ago

I think the latest automatic release with pytorch2 already does this for you?

On Sat 13. 5. 2023 at 8:59, pudepiedj @.***> wrote:

I recommend reading the very good documentation on the PyTorch website which has examples showing how to use the MPS device and how to load data onto it.https://pytorch.org/docs/stable/notes/mps.htmlSent from my iPhoneOn 12 May 2023, at 21:39, Robert Dean @.***> wrote:

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best. After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go. Hope this helps.

Where do you put that line of code? device = torch.device('mps')

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1546548804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACFGMM65ZJ63V6EYDELOWTXF4WNXANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you commented.Message ID: @.***>

pudepiedj commented 1 year ago

I am not sure what you mean. PyTorch2 has MPS support through torch.mps and the PyTorch-nightly now at 2.1.0v20330512 also has it, but unless I have missed something the mps device must still be deliberately invoked because some hardware systems don’t have it. Please let me know if I am mistaken!Sent from my iPhoneOn 13 May 2023, at 08:16, Honza Javorek @.***> wrote: I think the latest automatic release with pytorch2 already does this for you?

On Sat 13. 5. 2023 at 8:59, pudepiedj @.***> wrote:

I recommend reading the very good documentation on the PyTorch website which has examples showing how to use the MPS device and how to load data onto it.https://pytorch.org/docs/stable/notes/mps.htmlSent from my iPhoneOn 12 May 2023, at 21:39, Robert Dean @.***> wrote:

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best. After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go. Hope this helps.

Where do you put that line of code? device = torch.device('mps')

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1546548804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACFGMM65ZJ63V6EYDELOWTXF4WNXANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you commented.Message ID: @.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

vanilladucky commented 1 year ago

Excuse me, could you please tell me how to activate the MPS mode. I don't quite understand this.

On Mac, cuda doesn't work as it doesn't have a dedicated nvidia GPU. So we would have to download a specific version of PyTorch to utilize the Metal Performance Shaders (MPS) backend. This webpage on Apple explains it best. After installing the specific version of PyTorch, you should be able to simply call the MPS backend. Personally, I utilize this line of code device = torch.device('mps') and you can check by calling on device and if it gives you back 'mps', you are good to go. Hope this helps.

Where do you put that line of code? device = torch.device('mps')

So the line of code device = torch.device('mps') is merely a line to initiate the device as mps instead of the normal cpu. If we don't run this line, PyTorch would just place its data and parameters on the cpu. So this line has be run anywhere in the code. However, be it on Jupyter notebooks or Python code, I recommend you to make sure it runs at the very top or somewhere where you import all your necessary libraries.

Without this line ran first, when you move your model and data to device, .to(device = device), those data won't be placed in the mps.

If you are new to PyTorch and the usage of mps on mac, I encourage you to read loading data onto the mps here. It is important to know how to load data and model parameters onto devices if you wish to run large models quickly. Without them, it would probably take you hours and even days to run just one epoch.

Hope this helps!

honzajavorek commented 1 year ago

What about this?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/b08500cec8a791ef20082628b49b17df833f5dda/modules/devices.py#LL38C21

dlebouc commented 1 year ago

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --no-half (without --precision full) works perfectly for me. Since I added PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 I didn't encountered the bug and the 4 performance cores of my MacBook M1 are much used than before

BrjGit commented 1 year ago

Total noob here. Trying to utilize stable diffusion with deforum extension. Where exactly do I input the PYTORCH_MPS_HIGH_WATERMARK code into?

dlebouc commented 1 year ago

Total noob here. Trying to utilize stable diffusion with deforum extension. Where exactly do I input the PYTORCH_MPS_HIGH_WATERMARK code into?

In terminal, type : cd ~/stable-diffusion-webui; PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --no-half

BrjGit commented 1 year ago

Total noob here. Trying to utilize stable diffusion with deforum extension. Where exactly do I input the PYTORCH_MPS_HIGH_WATERMARK code into?

In terminal, type : cd ~/stable-diffusion-webui; PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --no-half

Lifesaver. Thank you. It works now.

akamitoro commented 1 year ago

This seems to help, at least in my case:

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half

tyvm sir, this works but it is painfully long 2,3 hours to upscale 2x an image from 640x950 res. Is there anyway to speed this up? what setting to adjust highres.fix?

pudepiedj commented 1 year ago

Have you tried all the Apple optimisation suggestions at https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Installation-on-Apple-Silicon? In the last paragraph there are specific suggestions about timing and how to improve it.

On Sat, May 13, 2023 at 10:45 PM akamitoro @.***> wrote:

This seems to help, at least in my case:

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half

tyvm sir, this works but it is painfully long 2,3 hours to upscale 2x an image from 640x950 res. Is there anyway to speed this up? what setting to adjust highres.fix?

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1546755946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YOVVNI73D4733H33RLXF76HTANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you commented.Message ID: @.***>

pudepiedj commented 1 year ago

I see what you mean. I was misunderstanding you to be suggesting that PyTorch2 automatically selects the mps device, which I don't think it does. Sorry for the confusion!

On Sat, May 13, 2023 at 12:23 PM Honza Javorek @.***> wrote:

What about this?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/b08500cec8a791ef20082628b49b17df833f5dda/modules/devices.py#LL38C21

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1546626417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YLNYDNAW6V3WZLAYCTXF5VJPANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you commented.Message ID: @.***>

honzajavorek commented 1 year ago

@pudepiedj no problem!

honzajavorek commented 1 year ago

Regarding the settings, you can put the environment variable to your webui-user.sh as well. This is how my look like right now:

#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################

# Install directory without trailing slash
#install_dir="/home/$(whoami)"

# Name of the subdirectory
#clone_dir="stable-diffusion-webui"

# PyTorch settings
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --no-half --opt-sub-quad-attention --use-cpu interrogate"

# python3 executable
#python_cmd="python3"

... file continues unchanged ...

Then all you need to run your web UI is plain ./webui.sh, everything gets applied automatically.

pudepiedj commented 1 year ago

Does this in fact implement and use the MPS device? I've been investigating over the weekend using the Activity Monitor "GPU History" display and I don't think my GPU is being used at all; stable-diffusion is just running on the CPU. This of course may explain why I am not getting the "MPS Backend Out of Memory" error, too! :)

On Mon, May 15, 2023 at 9:17 AM Honza Javorek @.***> wrote:

Regarding the settings, you can put the environment variable to your webui-user.sh as well. This is how my look like right now:

!/bin/bash########################################################## Uncomment and change the variables below to your need:

Install directory without trailing slash#install_dir="/home/$(whoami)"

Name of the subdirectory#clone_dir="stable-diffusion-webui"

PyTorch settingsexport PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7export PYTORCH_ENABLE_MPS_FALLBACK=1

Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"export COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --no-half --opt-sub-quad-attention --use-cpu interrogate"

python3 executable#python_cmd="python3"

... file continues unchanged ...

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1547400004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YK442MUJIMJ54CVC3LXGHRATANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you were mentioned.Message ID: @.***>

honzajavorek commented 1 year ago

Not sure exactly. I'm just cargo culting the command line options based on whatever I read around the discussions and issues.

pudepiedj commented 1 year ago

I put a crude "print" statement into the function where the "has_mps()" is used that you found and pointed out, supposedly to load it and use it, but it has never appeared in any output, so I am far from sure that the devices.py script is ever executed. And I read somewhere that the has_mps() is only implemented in a late version of PyTorch, maybe even only in the nightly version. Apple's support for its own hardware is dreadful, and AMD seem to have lost interest.

On Mon, May 15, 2023 at 12:24 PM Honza Javorek @.***> wrote:

Not sure exactly. I'm just cargo culting the command line options based on whatever I read around the discussions and issues.

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1547674874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YORAGSW6L7NGMGKRUDXGIG5BANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you were mentioned.Message ID: @.***>

branksypop commented 1 year ago

Here's my current understanding of this issue :
PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 will make webui.sh work but using only CPU (+6min for txt2img)

On the other proposed alternative : I did install the mac version of pytorch using this apple's link : https://developer.apple.com/metal/pytorch/ I did read the explanation on loading MPS on this link : https://pytorch.org/docs/stable/notes/mps.html but either I don't understand it, or it is too general for me to understand how to implement it in automatic1111, or find the correct places where I'd need to load mps for the tensors in that specific way

So I'm currently inspecting modules/devices.py and modules/mac_specific.py Debuging mac_specific.py and its has_mps method though a custom debug script does output true to having an MPS ... so MPS device is correctly detected. However there seem to be some fixes to pytorch in that file and possibly the errors may lie there... Also not sure if the line proposed by @vanilladucky is an override that could be placed somewhere in device.py => (device = torch.device('mps')

My config i9, 16gb / Ventura 13.1 / AMD Radeon Pro 5500M 4 GB

branksypop commented 1 year ago

I got this to work (AMD, intel) just by using arguments, in my specific case : --upcast-sampling --no-half-vae --no-half --opt-split-attention-v1 --lowvram --use-cpu interrogate --skip-torch-cuda-test

using stable diffusion 2.1, 768x768, txt2img takes 3 minutes and uses GPU, it is still long but much shorter than just using CPU

Now the question is if pytorch for mac installation was really needed ... but I'll check that some other time

pudepiedj commented 1 year ago

Interesting. Are you sure it is using the Radeon GPU? I had a similar result using these arguments but when I looked at the GPU history in Activity Monitor I saw that it was in fact running almost entirely on the Intel 630, not the Radeon 5500M. (I have exactly the same hardware as you.)

On Sun, May 21, 2023 at 6:20 PM branksypop @.***> wrote:

I got this to work (AMD, intel) just by using arguments, in my specific case : --upcast-sampling --no-half-vae --no-half --opt-split-attention-v1 --lowvram --use-cpu interrogate --skip-torch-cuda-test

using stable diffusion 2.1, 768x768, txt2img takes 3 minutes and uses GPU, it is still long but much shorter than just using CPU

Now the question is if pytorch for mac installation was really needed ... but I'll check that some other time

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1556230843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YKMY2BZOI46XHWCNULXHJFGTANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you were mentioned.Message ID: @.***>

branksypop commented 1 year ago

Yes 100% sure. I had the monitors open all the time, GPU and CPU, mostly to inspect if GPU was used at all depending on the arguments I used. Never at all had the case of the Intel630 firing up, but even if it did I'd still consider it a succes as its still (some kind of GPU acceleration) vs CPU cores only. for me all that PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 does is to force a CPU usage only that takes +6 min to 20min per photo

pudepiedj commented 1 year ago

OK. Thank you for the clarification. Please be good enough to give the exact details of what you typed and where, because I am unable to reproduce this behaviour.

On Sun, May 21, 2023 at 7:08 PM branksypop @.***> wrote:

Yes 100% sure. I had the monitors open all the time, GPU and CPU, mostly to inspect if GPU was used at all depending on the arguments I used. Never at all had the case of the Intel630 firing up, but even if it did I'd still consider it a succes as its still (some kind of GPU acceleration) vs CPU cores only. for me all that PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 does is to force a CPU usage only that takes +6 min to 20min per photo

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1556244916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YNXIA76MF2LXSV2PK3XHJK2ZANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you were mentioned.Message ID: @.***>

branksypop commented 1 year ago

Hello @pudepiedj , I copied the exact full arguments that are used and traced back once you run ./webui.sh - it prints out what are the actual arguments it was launched with so that you can verify that it really uses what you think you have set up.

there are several ways to set them up or overide them depending on your preferences. In my example for testing purposes I have export COMMANDLINE_ARGS="--skip-torch-cuda-test"in the file webuis-user.sh and I add the other arguments on launch like this ./webui.sh --upcast-sampling --no-half-vae --no-half --opt-split-attention-v1 --lowvram --use-cpu interrogate

pudepiedj commented 1 year ago

Cool, thanks! But I did all this, including incorporating the command-line parameters into the webui-user.sh shell-script, and it still doesn't use the GPU. Are you just relying on device.py and the "has-mps()" function to trigger GPU use? I can see no other way to do it, but it clearly doesn't work at least for me. I also don't understand - and perhaps therefore misunderstand - the --use-cpu parameter: doesn't that tell the system NOT to use the GPU?

On Wed, May 24, 2023 at 7:20 PM branksypop @.***> wrote:

Hello @pudepiedj https://github.com/pudepiedj , I copied the exact full arguments that are used and traced back once you run ./webui.sh - it prints out what are the actual arguments it was launched with so that you can verify that it really uses what you think you have set up.

there are several ways to set them up or overide them depending on your preferences. In my example for testing purposes I have export COMMANDLINE_ARGS="--skip-torch-cuda-test"in the file webuis-user.sh and I add the other arguments on launch like this ./webui.sh --upcast-sampling --no-half-vae --no-half --opt-split-attention-v1 --lowvram --use-cpu interrogate

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9133#issuecomment-1561730615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGG22YLCAC5LNA7VPXD2VVTXHZGNBANCNFSM6AAAAAAWLQ2XOU . You are receiving this because you were mentioned.Message ID: @.***>

GyperionUA commented 1 year ago

Just want to leave here my feedback. I run easy-diffusion on my Mac and it generates txt2img 512x512 for about 2 min or so. When I tried to use stable-diffusion on the same Mac... 1) I had an error "'"upsample_..._last" not implemented for 'Half'' 2) when I ran with --no-half parameter, I got error "... Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure)." Which brought me here))) 3) when I tried to run with "PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half" it worked, but txt2img generation takes for about 8-10min. That brings me to understanding that current parameters don't let me use Mac performance at a level, which easy-diffusion allows. So there should be a way to improve it, but I have no idea how to do that ((( Also tried to find run parameters for easy-diffusion, but don't know where to search them...

wwwAngHua commented 1 year ago

这似乎有帮助,至少对我来说是这样:

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half

thank you!!!

hal0thane commented 1 year ago

I started encountering this problem shortly after upgrading to Ventura 13.3. I was running fine before then, and had better performance than with the PyTorch 1.x model, since I was no longer getting the Warning: The operator 'aten::index.Tensor' is not currently supported on the MPS backend and will fall back to run on the CPU. message.

Has anyone tried updating to Ventura 13.4? If not, that's an idea. But if someone has already done that and found that it didn't help, another option is to downgrade to a previous version. Reinstalling 13.2.1 might be a better option than 13.4. As far as I can tell, this problem did not exist on 13.2 or 13.2.1.

Have any of you experienced it running something besides 13.3 or 13.3.1?

shamshhoda commented 1 year ago

Hi, I guess you're also using stable diffusion with controlnet here. One easy way is to reduce your batch size. For eg. if you kept Batch size as 8, reduce to 4 or 5. or lastly just 1. It should work and would be faster. Try this without defining PYTORCH ratio.

This seems to help, at least in my case:

PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 ./webui.sh --precision full --no-half

tyvm sir, this works but it is painfully long 2,3 hours to upscale 2x an image from 640x950 res. Is there anyway to speed this up? what setting to adjust highres.fix?

Chi-XU-Sean commented 1 year ago

Hello @pudepiedj , I copied the exact full arguments that are used and traced back once you run ./webui.sh - it prints out what are the actual arguments it was launched with so that you can verify that it really uses what you think you have set up.

there are several ways to set them up or overide them depending on your preferences. In my example for testing purposes I have export COMMANDLINE_ARGS="--skip-torch-cuda-test"in the file webuis-user.sh and I add the other arguments on launch like this ./webui.sh --upcast-sampling --no-half-vae --no-half --opt-split-attention-v1 --lowvram --use-cpu interrogate

Thank you! It works on my M2 Max device. It uses GPU instead of CPU.

ohmygenie commented 1 year ago

Hello, I have been trying to build a simple python GUI using tkinter for stable diffusion. I am always hitting the same issue since I'm using M1 mac. Here's my code, I tried adding the --skip-torch-cuda-test directly in my .py code but it's not working, please help.

Error: RuntimeError: MPS backend out of memory (MPS allocated: 16.46 GB, other allocations: 1.98 GB, max allowed: 18.13 GB). Tried to allocate 1024.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

import os from diffusers import StableDiffusionPipeline

Set environment variables

os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.9" os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

Set command line arguments

os.environ["COMMANDLINE_ARGS"] = "--skip-torch-cuda-test --upcast-sampling --no-half-vae --no-half --opt-sub-quad-attention --use-cpu interrogate"

SDV5_MODEL_PATH = "/Users/user/stable-diffusion-v1-5/" SAVE_PATH = os.path.join(os.environ['HOME'], "Desktop", "SDV5_OUTPUT")

if not os.path.exists(SAVE_PATH): os.mkdir(SAVE_PATH)

def uniquify(path): filename, extension = os.path.splitext(path) counter = 1

while os.path.exists(path):
    path = filename + " (" + str(counter) + ")" + extension
    counter += 1

return path

prompt = "A dog rising in motorcycle"

print(f"Characters in prompt: {len(prompt)}, limit: 200")

pipe = StableDiffusionPipeline.from_pretrained(SDV5_MODEL_PATH) pipe = pipe.to("mps")

output = pipe(prompt)

Use the images attribute to access the generated images

image = output.images[0] # Adjusted this line based on your findings

Save the image

image_path = uniquify(os.path.join(SAVE_PATH, (prompt[:25] + "...") if len(prompt) > 25 else prompt) + ".png") print(image_path)

image.save(image_path)

@pudepiedj @branksypop @honzajavorek

tmm1 commented 1 year ago

the default values can be seen in the source code:

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.mm#L25-L35

  static const char* high_watermark_ratio_str = getenv("PYTORCH_MPS_HIGH_WATERMARK_RATIO");
  const double high_watermark_ratio =
      high_watermark_ratio_str ? strtod(high_watermark_ratio_str, nullptr) : default_high_watermark_ratio;
  setHighWatermarkRatio(high_watermark_ratio);

  const double default_low_watermark_ratio =
      m_device.hasUnifiedMemory ? default_low_watermark_ratio_unified : default_low_watermark_ratio_discrete;
  static const char* low_watermark_ratio_str = getenv("PYTORCH_MPS_LOW_WATERMARK_RATIO");
  const double low_watermark_ratio =
      low_watermark_ratio_str ? strtod(low_watermark_ratio_str, nullptr) : default_low_watermark_ratio;
  setLowWatermarkRatio(low_watermark_ratio);

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.h#L299-L306

  // (see m_high_watermark_ratio for description)
  constexpr static double default_high_watermark_ratio = 1.7;
  // we set the allowed upper bound to twice the size of recommendedMaxWorkingSetSize.
  constexpr static double default_high_watermark_upper_bound = 2.0;
  // (see m_low_watermark_ratio for description)
  // on unified memory, we could allocate beyond the recommendedMaxWorkingSetSize
  constexpr static double default_low_watermark_ratio_unified  = 1.4;
  constexpr static double default_low_watermark_ratio_discrete = 1.0;

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.h#L326-L332

  // high watermark ratio is a hard limit for the total allowed allocations
  // 0. : disables high watermark limit (may cause system failure if system-wide OOM occurs)
  // 1. : recommended maximum allocation size (i.e., device.recommendedMaxWorkingSetSize)
  // >1.: allows limits beyond the device.recommendedMaxWorkingSetSize
  // e.g., value 0.95 means we allocate up to 95% of recommended maximum
  // allocation size; beyond that, the allocations would fail with OOM error.
  double m_high_watermark_ratio;
ohmygenie commented 1 year ago

@pudepiedj @branksypop @honzajavorek

the default values can be seen in the source code:

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.mm#L25-L35

  static const char* high_watermark_ratio_str = getenv("PYTORCH_MPS_HIGH_WATERMARK_RATIO");
  const double high_watermark_ratio =
      high_watermark_ratio_str ? strtod(high_watermark_ratio_str, nullptr) : default_high_watermark_ratio;
  setHighWatermarkRatio(high_watermark_ratio);

  const double default_low_watermark_ratio =
      m_device.hasUnifiedMemory ? default_low_watermark_ratio_unified : default_low_watermark_ratio_discrete;
  static const char* low_watermark_ratio_str = getenv("PYTORCH_MPS_LOW_WATERMARK_RATIO");
  const double low_watermark_ratio =
      low_watermark_ratio_str ? strtod(low_watermark_ratio_str, nullptr) : default_low_watermark_ratio;
  setLowWatermarkRatio(low_watermark_ratio);

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.h#L299-L306

  // (see m_high_watermark_ratio for description)
  constexpr static double default_high_watermark_ratio = 1.7;
  // we set the allowed upper bound to twice the size of recommendedMaxWorkingSetSize.
  constexpr static double default_high_watermark_upper_bound = 2.0;
  // (see m_low_watermark_ratio for description)
  // on unified memory, we could allocate beyond the recommendedMaxWorkingSetSize
  constexpr static double default_low_watermark_ratio_unified  = 1.4;
  constexpr static double default_low_watermark_ratio_discrete = 1.0;

https://github.com/pytorch/pytorch/blob/bfd995f0d6bf87262613b5e89d871832ca9e9938/aten/src/ATen/mps/MPSAllocator.h#L326-L332

  // high watermark ratio is a hard limit for the total allowed allocations
  // 0. : disables high watermark limit (may cause system failure if system-wide OOM occurs)
  // 1. : recommended maximum allocation size (i.e., device.recommendedMaxWorkingSetSize)
  // >1.: allows limits beyond the device.recommendedMaxWorkingSetSize
  // e.g., value 0.95 means we allocate up to 95% of recommended maximum
  // allocation size; beyond that, the allocations would fail with OOM error.
  double m_high_watermark_ratio;

Thanks, apparently, my torch installation at M1 was having a problem. I've reinstalled it and it's now working. Now, I received a new error:

NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

--> Essentially here's what's happening for Apple silicon user: Option #1: GPU (not possible), Option #2: CPU (I tried it, takes 30 minutes to generate 1 picture), Option #3: MPS -> But I have this new error above. Option #4: Try to use AUTOMATIC1111 which impressively generates 1 picture for only 20 seconds; however, it's not customisable, say if you want to build something like that as a project for a client.

So yeah, it's the painful situation for Apple silicon users wanting to build an AI program using SD from scratch.

BewhY08 commented 1 year ago

Replacing this code will allow you to map it, but the ControlNet functionality will not work properly

关于设置,您webui-user.sh也可以将环境变量添加到您的环境变量中。这就是我现在的样子:

#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################

# Install directory without trailing slash
#install_dir="/home/$(whoami)"

# Name of the subdirectory
#clone_dir="stable-diffusion-webui"

# PyTorch settings
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --no-half --opt-sub-quad-attention --use-cpu interrogate"

# python3 executable
#python_cmd="python3"

... file continues unchanged ...

然后,运行 Web UI 所需的一切都很简单./webui.sh,一切都会自动应用。

Replacing this code will allow you to map it, but the ControlNet functionality will not work properly

ohmygenie commented 1 year ago

Replacing this code will allow you to map it, but the ControlNet functionality will not work properly

关于设置,您webui-user.sh也可以将环境变量添加到您的环境变量中。这就是我现在的样子:

#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################

# Install directory without trailing slash
#install_dir="/home/$(whoami)"

# Name of the subdirectory
#clone_dir="stable-diffusion-webui"

# PyTorch settings
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7
export PYTORCH_ENABLE_MPS_FALLBACK=1

# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --no-half-vae --no-half --opt-sub-quad-attention --use-cpu interrogate"

# python3 executable
#python_cmd="python3"

... file continues unchanged ...

然后,运行 Web UI 所需的一切都很简单./webui.sh,一切都会自动应用。

Replacing this code will allow you to map it, but the ControlNet functionality will not work properly

Thanks, I presume this answer is for AUTOMATIC1111 users, correct? This won't be applicable for those who are building their customised program using stable diffusion, from scratch, as all of the dependencies will need to be done. Editing webui.sh is not applicable for this scenario.

Looking forward from someone who was able to run stable diffusion successfully in their Apple silicon machines using MPS (not CPU) with their own customised program.

luluaidota commented 1 year ago

I run this in 13.4.1 but also have the same problem

thedoger82 commented 1 year ago

For me the problem was the canvas size (1280x720), so i used something smaller (640x320) and i got no more mps problems, in case you need higher resolutions, create your images/videos with small resolutions and then use Topaz another AI which will do the job of increasing size and quality