[Bug]: Error: Connection errored out.

HaruomiX commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

This is my first time making a report on github, so I might miss a few things here and there.

I was using it normally for the first 5 minutes, after adding a couple of models, and changed model for the first time, it stayed refreshing for 5 minutes so I decided to refresh the page and now I see errors all over the page, when I type in the input box it shows an error and I can't change models it shows error there, I tried reloading UI, reinstalling the whole git and deleting the huggingface folder from ./cache sometimes it works but after 1 minute it breaks again

I've tried checking the browser console all I see is 1 error and its Firefox can’t establish a connection to the server at ws://127.0.0.1:7860/queue/join.

Steps to reproduce the problem

Unknown

What should have happened?

Should not show error messages and work normally.

Commit where the problem happens

955df775

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

None

List of extensions

Default

Console logs

venv "D:\Workspace\Stable Diffusion WebUI\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Commit hash: 955df7751eef11bb7697e2d77f6b8a6226b21e13
Installing requirements for Web UI
Launching Web UI with arguments:
No module 'xformers'. Proceeding without it.
Loading weights [abcaf14e5a] from D:\Workspace\Stable Diffusion WebUI\stable-diffusion-webui\models\Stable-diffusion\anything-v3-full.safetensors
Creating model from config: D:\Workspace\Stable Diffusion WebUI\stable-diffusion-webui\configs\v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0):
Model loaded in 5.2s (load weights from disk: 0.5s, create model: 0.6s, apply weights to model: 0.9s, apply half(): 0.9s, move model to device: 0.9s, load textual inversion embeddings: 1.3s).
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 15.7s (import torch: 2.7s, import gradio: 2.5s, import ldm: 0.6s, other imports: 2.6s, load scripts: 1.1s, load SD checkpoint: 5.6s, create ui: 0.4s, gradio launch: 0.1s).

Additional information

No response

AndreyDonchev commented 1 year ago

Same here. Happened after the last update. Clipboard02 Clipboard01

HaruomiX commented 1 year ago

After trying a bunch of stuff I have found out that adding the code below at the bottom of the style.css file helps a bit, but problems still occur sometimes.

[id^="setting_"] > div[style*="position: absolute"] {
    display: none !important;
}

Oxygeniums commented 1 year ago

I have the same error when I try to connect from another device, although everything is fine on the main computer

tangbaiwan commented 1 year ago

取消外网，清除代理信息就可以

bjl101501 commented 1 year ago

I have the same error when I try to connect from another device, although everything is fine on the main computer

me too

mik3lang3lo commented 1 year ago

Same problem, did a fresh install, always happens when sending an image to extras and trying to scale it

mik3lang3lo commented 1 year ago

Happens with any upscaler, all the time, fresh install, any model, any VAE. Any suggestion to try?

RchGrav commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

muzipiao commented 1 year ago

取消外网，清除代理信息就可以

可部署在服务器上，只能用外网

ProGamerGov commented 1 year ago

I'm running into this issue as well

hashnag commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

Same issue here on Ubuntu 2204. This fixed it for me (thanks!).

Rayregula commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

Been struggling with the same thing for a few days now. This does fix the ui but I require that the queue is working so hopefully we can figure out the reason it has been breaking everything :(

Has been broken in every commit I've tried since the option was added

EricChanc commented 1 year ago

How can I solve this problem？

terrificdm commented 1 year ago

Same issue, any way to solve this issue without flag with "--no-gradio-queue"? thanks

Shaiktit commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

sorry dumb question, how do i run this command line? do i insert in the batch file? if i run on python it tells me syntax error

terrificdm commented 1 year ago

Just copy and paste export COMMANDLINE_ARGS="--no-gradio-queue" in your cli tool which you use to runbash webui.sh command, then press enter. And run this before run bash webui.sh, or you can put this command line in your ~/.bashrc file

Shaiktit commented 1 year ago

Just copy and paste export COMMANDLINE_ARGS="--no-gradio-queue" in your cli tool which you use to runbash webui.sh command, then press enter. And run this before run bash webui.sh, or you can put this command line in your ~/.bashrc file

usually i just run double click the batch file on windows. i dont usually run on a cli. do i run it like this? then when i run my stable diffusion like this

issiah-chain commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

this can solve the problem, while another problem emerged: when one task is processing, I cannot put a second one in task queue. After I click the "Generate" to submit second task, it will show "In queue..." forever

foxytocin commented 1 year ago

Remote Instance

After updating the new newest version (today) i get the same error using the webui on a remote instance using --listen --port 4000 --api --no-half --gradio-auth user:name --api-auth user:name --hide-ui-dir-config --cors-allow-origins=*
request via API just works fine even queuing

Local Instance

Works just fine.
Testes with and without different commands
- --listen --port 4000 works (with and without)
- --gradio-auth user:name (with and without)
- --api --api-auth user:name (with and without)

Rayregula commented 1 year ago

Remote Instance

After updating the new newest version (today) i get the same error using the webui on a remote instance using --listen --port 4000 --api --no-half --gradio-auth user:name --api-auth user:name --hide-ui-dir-config --cors-allow-origins=*

request via API just works fine even queuing

Local Instance

Works just fine.

Testes with and without different commands

--listen --port 4000 works (with and without)

--gradio-auth user:name (with and without)

--api --api-auth user:name (with and without)

Which branch are you using? latest I see for the master branch is 2 weeks ago? 22bcc7b

jpenalbae commented 1 year ago

Using master which includes 22bcc7be428c94e9408f589966c2040187245d81 does not fix the problem for me. The only workaround is using --no-gradio-queue

error

honunu commented 1 year ago

I am running ClashX proxy, when I quit the software, the error goes away.

Jackyboy1988 commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

worked for me. thanks!

Doublefire-Chen commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

Thank you, it works on my Ubuntu Server 20.04

Rayregula commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.)

When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists.

Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working.

Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem.

Anyway, that was the issue for me and I hope stating it here helps someone else.

bjl101501 commented 1 year ago

如果这些问题中的任何一个只发生在使用 --gradio-auth 的 http 连接上，我能得到确认吗？（除了谁已经说过他们有问题。

当启用 gradio 队列并尝试使用 websockets 时，它会尝试访问 https 连接的登录 cookie 并失败，因为只有从 http 创建的那个存在。

显然是一个记录在案的 gradio 问题。我已经尝试修复它大约两周了。只是希望人们说使用--no-gradio-queue会提到这就是原因，因为我需要队列工作。

一旦我知道这是问题所在，我花了大约 5 秒钟来使用 ssl 证书进行修复。我浪费了很多时间认为webui的队列实现是问题所在。

无论如何，这对我来说是问题所在，我希望在这里陈述它可以帮助其他人。

Referring to gradio's bug fix, currently manually modifying the routes file can solve the problem，But there are still plug-ins that report errors here https://github.com/gradio-app/gradio/pull/3735/files

ghost commented 1 year ago

Preparing dataset... 0%| | 0/9 [00:00<?, ?it/s]/Users/me/Downloads/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling') /Users/me/Downloads/stable-diffusion-webui/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:736: UserWarning: The operator 'aten::index.Tensor' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.) pooled_output = last_hidden_state[ 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:04<00:00, 2.09it/s] /Users/me/Downloads/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:115: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.") 0%| | 0/100000 [00:00<?, ?it/s]/AppleInternal/Library/BuildRoots/9941690d-bcf7-11ed-a645-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolution.mm:1663: failed assertion Only Float32 convolution supported zsh: abort ./webui.sh me@MacBook-Pro stable-diffusion-webui % /usr/local/Cellar/python@3.10/3.10.10_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

I keep getting errors about Only Float32 convolution supported. Anyone know? This leads to the timed out message stated above. I tried the command mentioned above, but same issue. Once the dataset is prepared with the 100% bar filled, the 2nd bar crashes.

wjbeeson commented 1 year ago

Having this same issue running a queue in a remote instance

wjbeeson commented 1 year ago

Rayregula Can you elaborate on you fixed this issue?

jpenalbae commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.)

When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists.

Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working.

Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem.

Anyway, that was the issue for me and I hope stating it here helps someone else.

Yes, the issue only happens when using --gradio-auth option.

Rayregula commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.) When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists. Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working. Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem. Anyway, that was the issue for me and I hope stating it here helps someone else.

Yes, the issue only happens when using --gradio-auth option.

Nice. Glad it was the same problem.

wjbeeson commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.) When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists. Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working. Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem. Anyway, that was the issue for me and I hope stating it here helps someone else.

Yes, the issue only happens when using --gradio-auth option.

Nice. Glad it was the same problem.

I don't see the --gradio-auth option. How do you turn this off?

Rayregula commented 1 year ago

Rayregula Can you elaborate on you fixed this issue?

@wjbeeson Sure.

Quick recap of the issue:

When using --gradio-auth and a user logs in to the webui the special token (random number/letter combo) is generated and stored by the browser (as a cookie). There is one stored for http and another for https and when you use --gradio-queue (which is now the default) Gradio (which runs the ui) tries to use the cookie saved for https. When you also use the --listen argument to allow connections from the same network you are using http so after logging in Gradio can't read the cookie it wants and then thinks you're not connected to the ui.

There are a few possible ways to get around this bug each with their own downside:

Fix	Pro	Con
Use the `--no-gradio-queue` flag.	Makes the ui look for the cookie for http	Disables the queue.
Don't use the `--gradio-auth` flag.	Does not check if you were able to login	A bad idea if you are port forwarding access to the ui from the internet.
Use the `--share` flag instead of or in addition to `--listen`	Gradio's online link which is secured with ssl so you won't have a problem.	May not work with the API. Link resets every 72 hours.
Setup an SSL certificate for the webui	Webui will then be able to use https and will find the right authentication cookie	Must have a domain name (you may be able to get your browser to accept a self signed but I believe chrome no longer supports this)
Modify the routes.py file that Gradio uses (see 1981c010), setting it to use the correct cookie	pretty easy	if installed through pip it could revert your changes if it tries to force an update or reinstall (I don't know if it checks file integrity or not)

For changing routes.py on linux:

pip show gradio | grep "Location" This will tell you where Gradio is installed at. (For me that is .local/lib/python3.9/site-packages/gradio/routes.py)
Use your favorite text editor to edit that file and add the changes shown here 1981c010 or just change "access-token" to "access-token-unsecure"
Restart the ui (if it was running) and you should be good to go.

Note: Ideally you could just update the version of Gradio being used, but I hear it would break the API as they are not compatible at the moment.

For creating an SSL cert if you have a domain name: You can use letsencrypt or do it through a reverse proxy like nginx proxy manager which I did (I already had one setup so it was very easy)

Rayregula commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.) When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists. Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working. Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem. Anyway, that was the issue for me and I hope stating it here helps someone else.

Yes, the issue only happens when using --gradio-auth option.

Nice. Glad it was the same problem.

I don't see the --gradio-auth option. How do you turn this off?

It is off by default unless you use it when starting the ui see the list of command line arguments

jpenalbae commented 1 year ago

Nice catch @Rayregula . I edited the routes.py from gradio as you pointed out and now its working fine.

Also tried installing gradio 3.27.0 which already includes the patch, and I can confirm it is not compatible and sd will refuses to boot with it.

id88viper88id commented 1 year ago

Hello to everyone,

I know that my post is a bit lengthy, but describing my problem, I want to describe everything in as much detail as I can, so as to save time and discuss not things you might think I have not noticed/read about or tried out.

I am new to GitHub as far as having an account here is concerned, and quite new to Stable Diffusion, but I I’m a fast learner, and I have learned a lot about Stable Diffusion in just 3 day’s time.

The reason I’m writing here is because I have an issue with making Stable Diffusion to train through textual inversion. To start with my specs, I am running on Mac M2 Pro with 12 cores in the processor, 32 GB of RAM, an 8 TB SSD and MacOS Ventura 13.3.1. I have installed Stable Diffusion locally, and its Commit is 22bcc7be. I have successfully made an embedding with initialization words/prompts, preprocessed the training images of my choice and provided the directory with the preprocessed images (the Dataset directory), as well as the directory for the log file. The problem is that when I want to perform the embedding training, I am returned with the Connection errored out error in the Stable Diffusion web browser interface. The entire error in the Terminal is as follows:

Training at rate of 0.005 until step 100000
Preparing dataset...
100%|█████████████████████████████████████████| 487/487 [02:58<00:00,  2.72it/s]
  0%|                                                | 0/100000 [00:00<?, ?it/s](mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:228:0: error: 'mps.add' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:228:0: note: see current operation: %7 = "mps.add"(%6, %arg2) : (tensor<1x4096x320xf32>, tensor<*xf16>) -> tensor<*xf32>
zsh: segmentation fault  ./webui.sh
hubert@Huberts-Mac-Mini stable-diffusion-webui % /opt/homebrew/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

So, the first error you can see in the above is error: 'mps.add' op requires the same element type for all operands and results As far as this error in particular is concerned, I read in a post by @jwoolbridge234 in this thread (I simply look for information and help wherever I can) that using the --no-half flag should solve this problem, but on the other hand, a user @don1138 said in this thread that as far as he knows, --no-half is currently a default setting (on a side note, that could be true because when I had been trying to use the --no-half flag in order to allegedly make the image generation run faster, the result was the opposite, and the image generation was even slower than when not using any flag at all - one other thing that could have had to do with slowing the image generation down was the --medvram flag, which I had also used, which, as far as I understand, could have cut down my 32 GB VRAM down to 16 GB). Then, I also have two types of errors related to zsh. The first one of them is zsh: segmentation fault bash webui.sh, having to do with tensor<*xf32>, and takes place when I want to train the embedding with all of the settings left at their defaults, meaning that the resolution is 512x512, the learning speed is set to 0.005, no Hypernetwork is selected, the Gradient Clipping is disabled, Max steps are set to 100000, saving an image to log directory is set to 500 steps, making a copy of embedding is also set to 500, „Save images with embedding in PNG chunks” is checked, „Read parameters (prompt, etc...) from txt2img tab when making previews” is unchecked, „Shuffle tags by ',' when creating prompts” is unchecked, „Drop out tags when creating prompts” is set to 0, and „Choose latent sampling method” is set to„once”. With the default resolution of 512x512, the training lasts several minutes only, and then the errors described occur. Also, only with the resolution 512x512 does the *.json log file get created, and has the following information in it:

{' "datetime": "2023-04-26 22:19:00", "model_name": "v1-5-pruned-emaonly", "model_hash": "cc6cb27103", "num_of_dataset_images": 487, "num_vectors_per_token": 1, "embedding_name": “Landscapes”, "learn_rate": "0.005", "batch_size": 1, "gradient_step": 1, "data_root": "/Users/hubert/Documents/Stable Diffusion Datasets/Images/Landscapes/Source Landscapes", "log_directory": "/Users/hubert/Documents/Stable Diffusion Datasets/Images/Landscapes/Log/2023-04-26/Landscapes", "training_width": 512, "training_height": 512, "steps": 100000, "clip_grad_mode": "disabled", "clip_grad_value": "0.1", "latent_sampling_method": "once", "create_image_every": 500, "save_embedding_every": 500, "save_image_with_stored_embedding": true, "template_file": "/Users/hubert/stable-diffusion-webui/textual_inversion_templates/subject_filewords.txt", "initial_step": 0 }

The second type of the zsh error is:

706: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error:DArray dimension length > INT_MAX
zsh: abort      bash webui.sh

In this case, the only difference in comparison to the previous settings is that the resolution is set to the maximum available of 2048x2048. The reason why as high a resolution is that the original images I want to train Stable Diffusion on are of quite high resolutions, such as 1680 × 2036, 4032 × 3024 and other (on a side note, there is 487 of those images). Assuming that letting Stable Diffusion downscale them too much could later result in the generated images to be of equally lesser resolutions, and thus of a worse quality then they could be, I set Stable Diffusion to this maximum of 2048x2048 in order to prevent this. Stable Diffusion preprocessed them to this particular resolution with no problem, and as there are images of resolutions greater than 2048x2048, I had also checked the „Split oversized images” checkbox, leaving the default value of 0.5 for „Split image threshold”, and the default value of 0.2 for „Split image overlap ratio”. What accompanies the:

706: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error:DArray dimension length > INT_MAX
zsh: abort      bash webui.sh

error in the Terminal, is the Connection errored out error, which pops up in the web browser interface after like 3 seconds or so after clicking the „Train Embedding” button. When the error mentioned above occurs with this resolution, the log file is not even created. I cannot tell exactly what the problem is that this error takes place (which is why I’m writing here and asking for your help), but my obvious guess is that it is due to too high dimensions/a too high resolution (although since I can set it due to having as much as 32 GB of VRAM, I should also be able to train embeddings with as large images). Perhaps that abbreviation INT_MAX is to mean reaching/exceeding some internal maximum, a maximum integer or a maximum interval of some sort - I have been trying to find some information about this error on the Internet, but couldn’t, so I’m not sure. I think that having a 12 core M2 Pro processor and 32 GB of RAM, Mac Mini M2 Pro should have enough computation power to perform the training with 2048x2048 images, not to mention 512x512 images. Moreover, I also have a very lengthy Error Report generated by the Mac itself upon the occurrence of this error, but I assume it would be of little to no use in this situation.

As for the solutions provided here so far, they haven’t worked for me. For starters, so fat I have always started Stable Diffusion locally by changing directory to its main folder first, and then typing ./ webui.sh in the Terminal. Seeing the advice by @terrificdm here about running export COMMANDLINE_ARGS="--no-gradio-queue" before bash webui.sh, I have noticed that bash webui.sh indeed starts Stable Diffusion as well. Having noticed that, I did exactly as @terrificdm said, and ran export COMMANDLINE_ARGS="--no-gradio-queue" before bash webui.sh. However, that didn’t change anything, as the error in the Terminal is virtually the same as when running ./ webui.sh without export COMMANDLINE_ARGS="--no-gradio-queue" before it:

Training at rate of 0.005 until step 100000
Preparing dataset...
100%|███████████████████████████████████████| 1026/1026 [03:31<00:00,  4.85it/s]
  0%|                                                | 0/100000 [00:00<?, ?it/s](mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:228:0: error: 'mps.add' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:228:0: note: see current operation: %7 = "mps.add"(%6, %arg2) : (tensor<1x4096x320xf32>, tensor<*xf16>) -> tensor<*xf32>
zsh: segmentation fault  bash webui.sh
hubert@Huberts-Mac-Mini stable-diffusion-webui % /opt/homebrew/Cellar/python@3.10/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Now, judging by the fact that the same zsh: segmentation fault error is printed out in the Terminal when doing the training with the 512x512 resolution, printed out so regardless of whether I start Stable Diffusion by typing ./ webui.sh, export COMMANDLINE_ARGS="--no-gradio-queue" first, and then bash webui.sh or bash webui.sh alone, it seems that the problem lays in Training trying to write to, or simply access, some part of RAM at some point of the training process that it should not write/have access to. As far as I can tell, this is not something I could fix in any way as a regular user who does not mess around with the program code, which, as I can infer from this error, should be corrected. Has anyone else having issues with the Connection errored out error, be it under the img2img tab or while training under the Training tab, noticed this zsh: segmentation fault error? Have you tried to solve or work around it somehow, or do I have to wait for some future update of Stable Diffusion itself?

Modifying the routes.py file to, as @Rayregula put it, „use the correct cookies”, also did not help. I added:

token = websocket.cookies.get("access-token") or websocket.cookies.get(
                "access-token-unsecure"
            )

right before/above:

return token # token is returned to allow request in queue

, and it didn’t change anything.

I have never used either the --listen flag, nor the --share flag with it, assuming both of these are meant for when connecting to the Stable Diffusion web browser interface remotely (I assume the --listen flag is meant for listening on a particular port), whereas I use it entirely locally.

Just as many of you here have had the Connection errored out error for img2img (which I personally don’t because I can use img2img with no problem), probably being caused by some software issue(s), I think it is just the same for training embeddings. What even seems to indicate this is this piece of information that is printed out in the Terminal when starting Stable Diffusion:

You are running torch 1.12.1.
The program is tested to work with torch 1.13.1.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Obviously, it says I’m running torch 1.12.1, and not torch 1.13.1, which is said to cause issues with the Training tab, but I also have them anyway as well (I assume that updating torch from 1.12.1 that I have to 1.13.1 couldn’t do any more harm as it is now, but since there is this warning printed out in the Terminal that this later version causes problems with Training, I don’t feel like updating to it anyway).

At this point in time, I am clueless as to what I could do to make the training work. Aa I had said before, should I wait for another update of Stable Diffusion?

P.S. @ashoklathwal wrote a reply with the link to an article about resolving the Connection errored out error while I was still writing this post. I took a look at what is said in the article, and so:

The first solution can be ruled out because I run Stable Diffusion entirely locally, on my Mac Mini, just like many of you are running it locally on your computers as well, I suppose, so it's not that case.
I pretty much assume that the default Apple Safari web browser cache is not the issue because I have an 8 TB storage SSD drive with 7.33 TB of free space still left on it.
I could give the Google Chromium web browser a try, but again, since I have no crashes when generating images even larger than 1024x1024 in Apple Safari, I pretty much assume it is not Safari itself, which, strangely enough, would be to have a problem with training the embedding with 512x512 images, while, at the same time, being able to generate images larger than 1024x1024 with no problem.
I'm not running Stable Diffusion on Google Colab, but entirely locally, so it's not that case again.
In the case of training, there is only the Batch size value, which is already set to 1, and no Batch count value, so this doesn't apply.
I added the ”–no-gradio-queue” flag (preceding it with two --, and not one -, as they erroneously did) after Set COMMANDLINE_ARGS= (with no space between = and “) in the webui-user.bat file, and it did not help.

Best Regards

Rayregula commented 1 year ago

Check this detailed article if you are getting Stable Diffusion Connection Errored Out

I don't think that link is very useful to this issue, the advice it offers contains mostly extremely simple things for one off issues like:

Restart the webui (Good advice, maybe it crashed and you didn't notice. But if that's not the problem then what.)
Refresh your browser (Does not help to solve the issue at all. If you are logged out or have restarted the ui then you would need to refresh to reconnect, but this thread is for actual bugs.)
Try a different browser. (A good thing to attempt)
Disable the gradio queue. (Good advice as it actually relates to the issue, but we've already covered that. And what if the problem you have is you need the queue... It would be much more helpful if it had explained the root of that issue (being http login not working) or linked to the issue so we could go fix it ourselves)
Don't upload or work with large files. (We're mostly all working with 512x512 or 768x768 images. And again.. with the issue myself and a couple others were having you could not get far enough to upload or generate anything.)

Rayregula commented 1 year ago

@id88viper88id I replied to an earlier comment, but haven't ignored you, just will take me a bit to finish reading your message and formulate a response as it's massive 😅

Thank you for the detailed message though!

Rayregula commented 1 year ago

@id88viper88id

I am new to GitHub as far as having an account here is concerned, and quite new to Stable Diffusion, but I I’m a fast learner, and I have learned a lot about Stable Diffusion in just 3 day’s time.

Welcome!

The problem is that when I want to perform the embedding training, I am returned with the Connection errored out error in the Stable Diffusion web browser interface. The entire error in the Terminal is as follows:

The Issue myself and others here were having was that the webui would not communicate with SD (due to the login cookie being set wrong) Since you are getting a terminal error I expect the Connection Errored Out to be caused by the the program crashing resulting in the webui getting disconnected.

So, the first error you can see in the above is error: 'mps.add' op requires the same element type for all operands and results

I don't currently have anything to add regarding this error but I can answer some of your other questions

I read in a post by @jwoolbridge234 in this thread (I simply look for information and help wherever I can) that using the --no-half flag should solve this problem

I cannot find any mention of that in this thread?

a user @don1138 said in this thread that as far as he knows, --no-half is currently a default setting

I do not believe this is true as I have to use it to make my hardware work with the webui

(on a side note, that could be true because when I had been trying to use the --no-half flag in order to allegedly make the image generation run faster, the result was the opposite, and the image generation was even slower than when not using any flag at all

If the flag was already a default, specifying it again would not negate the effect

one other thing that could have had to do with slowing the image generation down was the --medvram flag, which I had also used, which, as far as I understand, could have cut down my 32 GB VRAM down to 16 GB).

The --medvram flag is most likely the cause of your performance decrease as it state in the wiki here: command-line-arguments "enable stable diffusion model optimizations for sacrificing a little speed for low VRM usage"

Then, I also have two types of errors related to zsh. The first one of them is zsh: segmentation fault bash webui.sh, having to do with tensor<*xf32>

I may not be much help with but you can try running ulimit in zsh (shows the memory limit for the terminal) and see if your terminals memory limit is causing the program to fail. May need to increase it with stacksize. I do not believe this to be a problem with the webui, and likely just the environment it is running in

The second type of the zsh error is:
706: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error:DArray dimension length > INT_MAX
zsh: abort      bash webui.sh
In this case, the only difference in comparison to the previous settings is that the resolution is set to the maximum available of 2048x2048. The reason why as high a resolution is that the original images I want to train Stable Diffusion on are of quite high resolutions, such as 1680 × 2036, 4032 × 3024 and other (on a side note, there is 487 of those images). Assuming that letting Stable Diffusion downscale them too much could later result in the generated images to be of equally lesser resolutions, and thus of a worse quality then they could be, I set Stable Diffusion to this maximum of 2048x2048 in order to prevent this.

As you guessed training with lower resolution images will result in the model being better at working with images at that lower resolution, though that does not stop you from re-upscaling them afterwards. I am not sure on this, but my guess is that since you are training such massive images that you will need to do a lot more training since there are more details in the higher res images. On top of the large images taking longer to train with anyway.

What accompanies the error in the Terminal, is the Connection errored out error, which pops up in the web browser interface after like 3 seconds or so after clicking the „Train Embedding” button. When the error mentioned above occurs with this resolution, the log file is not even created.

As I mentioned at the top of this comment, I believe you are getting the Connection Errored Out due to the program stopping when you get that error message. Causing the webui to be disconnected and the connection to error. This is further affirmed by the log file not being generated, I expect as it stopped unexpectedly.

my obvious guess is that it is due to too high dimensions/a too high resolution (although since I can set it due to having as much as 32 GB of VRAM, I should also be able to train embeddings with as large images).

The error message does seem to lead to that. I would try using a lower resolution for the moment. I will take a look at the code and see if the values are set incorrectly.

Perhaps that abbreviation INT_MAX is to mean reaching/exceeding some internal maximum, a maximum integer or a maximum interval of some sort

That is exactly what it means. the INT_MAX variable would be the highest number the array can be.

I think that having a 12 core M2 Pro processor and 32 GB of RAM, Mac Mini M2 Pro should have enough computation power to perform the training with 2048x2048 images, not to mention 512x512 images.

Processing wise I would think so, even if slowly. I do notice you keep switching between saying you have 32GB of RAM and saying you have 32GB of VRAM. Which are not the same thing (VRAM is much faster and lives on the GPU).

If you have a dedicated GPU the VRAM is dedicated video memory that only the GPU can use and is the main bottleneck in using Stable Diffusion as it's not cheap. My system has a card with 4GB of VRAM, if I wanted one with 32GB of VRAM it would cost me about $10,000 NVIDIA Tesla V100

If you are using NVIDIA's CUDA (very fast but I don't believe Mac's can use it anymore having moved away from NVIDIA, but I could be wrong) then you are required to use the VRAM and not RAM.

If you are training using the CPU in your Mac then Stable Diffusion will be using your RAM (not VRAM)

Just wanted to clear that up, as for example if you had not stated you were using a Mac but said you had 32GB of VRAM when you actually had 32GB of RAM then your GPU may not have supported SD at all with a 2GB VRAM card but we would not know.

In your defense your Mac may just refer them interchangeably if the Integrated GPU within the CPU can use your available RAM But that is incorrect and just results in problems as you can't be using 32GB of VRAM and still have 32GB of RAM as they are using the same memory. (apologies for the rant)

I also have a very lengthy Error Report generated by the Mac itself upon the occurrence of this error, but I assume it would be of little to no use in this situation.

Unless it's generic (like saying: "an error has occurred") then it could be useful. But we won't know for sure unless we can see it.

I have always started Stable Diffusion locally by changing directory to its main folder first, and then typing ./webui.sh in the Terminal.

This is fine

Seeing the advice by @terrificdm here about running export COMMANDLINE_ARGS="--no-gradio-queue" before bash webui.sh, I have noticed that bash webui.sh indeed starts Stable Diffusion as well.

Does basically the same thing. Using ./<file> tells the terminal to run the file and bash <file> just tells it to have bash (previously the default terminal on Mac before they moved to ZSH) open the file which essentially opens it in a fresh terminal. on mac doing zsh <file> would do the same thing. The example of bash is just because some linux OS's (for example debian) have bash as the default terminal.

For me I like to start the ui directly from python for example with python3 webui.py --listen --no-gradio-queue and pass the variables in directly (just have a start script that does it for me like that).

judging by the fact that the same zsh: segmentation fault error is printed out in the Terminal when doing the training with the 512x512 resolution, printed out so regardless of whether I start Stable Diffusion by typing ./ webui.sh, export COMMANDLINE_ARGS="--no-gradio-queue" first, and then bash webui.sh or bash webui.sh alone

I don't believe they are related so I would not expect a difference.

Modifying the routes.py file to, as @Rayregula put it, „use the correct cookies”, also did not help. I added:
token = websocket.cookies.get("access-token") or websocket.cookies.get(
               "access-token-unsecure"
         )
right before/above: return token # token is returned to allow request in queue , and it didn’t change anything.

You're not having that problem so it won't effect you,

I have never used either the --listen flag, nor the --share flag with it, assuming both of these are meant for when connecting to the Stable Diffusion web browser interface remotely (I assume the --listen flag is meant for listening on a particular port), whereas I use it entirely locally.

Yes, --listen makes it allow connections from the local network and --share hosts it on gradio for others to use.

I could give the Google Chromium web browser a try, but again, since I have no crashes when generating images even larger than 1024x1024 in Apple Safari, I pretty much assume it is not Safari itself, which, strangely enough, would be to have a problem with training the embedding with 512x512 images, while, at the same time, being able to generate images larger than 1024x1024 with no problem.

I don't believe it's a browser issue so you should be good without having to test that.

I added the ”–no-gradio-queue” flag (preceding it with two --, and not one -, as they erroneously did) after Set COMMANDLINE_ARGS= (with no space between = and “) in the webui-user.bat file, and it did not help.

The webui supports multiple operating systems, and each one needs to started a little differently and as such has 3 methods to start it:

webui.sh Linux/MacOS
webui.bat Windows
webui.py (universal python file) (but must pass your arguments manually as it bypasses the automatic setup)

When starting either webui.sh or webui.bat they grab the configuration from webui-user.sh or webui-user.bat (depending on OS), run launch.py which loads the options into arguments and starts webui.py

The file you modified was the windows configuration file which will have no effect if you are running the webui on MacOS It was a pretty low quality help article and only considered people using windows. Though I don't expect to to help anyway.

TLDR:

I don't think the Connection Errored Out error is the reason for your problems but was caused by it.
VRAM and RAM are different (you used them both a couple times through your post)
I will try and look into your image size issue as that does sound like a bug

I believe your main problem to be the segmentation fault. And the array being too large may be an actual bug.

My advice:

For the segmentation fault check your terminal memory limit with ulimit and increase it if needed
If that does not seem to be the issue, make this a new issue as I believe it too be different. Fill out the issue form properly (seeing this post I believe you will) giving us the current command line arguments you are using as they do change the behavior. As well as a more appropriate title so people having similar issue can find it (something like: "Training large resolution embedding "Error:DArray dimension length > INT_MAX"").

id88viper88id commented 1 year ago

@Rayregula

Welcome!

Thank you for the welcoming :)

@id88viper88id

The problem is that when I want to perform the embedding training, I am returned with the Connection errored out error in the Stable Diffusion web browser interface. The entire error in the Terminal is as follows:

@Rayregula

Since you are getting a terminal error I expect the Connection Errored Out to be caused by the the program crashing resulting in the webui getting disconnected.

If it is the program crashing, then I can't do much, if not anything, from my standpoint really.

@Rayregula

I cannot find any mention of that in this thread?

I was pointing out to other threads (using the encoding for entering web links), and I don't know, an am surprised myself, why the links to those threads are gone from my post… As for the post about --no-half allegedly solving the problem, it can be found in this thread (this time I won't use the embedding, and perhaps the link will not disappear): https://github.com/Mikubill/sd-webui-controlnet/issues/68 , by @jwoolbridge234 As for --no-half already being a default setting, it is mentioned in this thread: https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/7453 , by @don1138 In order for it to be easier to find by searching it on the page, in his post he specifically says:

My understanding is the --no-half is now set by default, so we don't need to include it. IDK how this affects my models, since I'm exclusively using fp16 versions of everything, but I'll follow best practices until I know more.

@id88viper88id

(on a side note, that could be true because when I had been trying to use the --no-half flag in order to allegedly make the image generation run faster, the result was the opposite, and the image generation was even slower than when not using any flag at all

@Rayregula

If the flag was already a default, specifying it again would not negate the effect

That's good to know, although seeing that anyhow the generation works faster without this flag, I guess I won't be using it anymore.

@Rayregula

The --medvram flag is most likely the cause of your performance decrease as it state in the wiki here: command-line-arguments "enable stable diffusion model optimizations for sacrificing a little speed for low VRM usage"

Thank you very much for pointing me to this Wiki article, as it clears out many unclarities. Looking at the table there, it does seem that --no-half is not a default flag indeed (in that case, @don1138 must have been wrong indeed). As a little side note (again), following other people's advice on the Internet, I had also been using --precision full along with --medvram, and I'm not sure how much setting --precision to full can indeed accelerate the image generation, but my guess now is that it could have been --medvram which compromised the effect of --precision full (I'm not confident about that - just guessing). Anyhow, since --medvram slows down the image generation, just like --no-half seems to to slow it down, I will refrain from using both of them.

@id88viper88id

Then, I also have two types of errors related to zsh. The first one of them is zsh: segmentation fault bash webui.sh, having to do with tensor<*xf32>

@Rayregula

I may not be much help with but you can try running ulimit in zsh (shows the memory limit for the terminal) and see if your terminals memory limit is causing the program to fail. May need to increase it with stacksize. I do not believe this to be a problem with the webui, and likely just the environment it is running in

I ran ulimit in the Terminal, and it said: unlimited. If the memory is not limited indeed, then it shouldn't be the problem, although I'm guessing that it could also be so that as far as it may be unlimited indeed, setting some specific value could solve the problem that could be occurring anyway.

@id88viper88id

The second type of the zsh error is:

@Rayregula

As you guessed training with lower resolution images will result in the model being better at working with images at that lower resolution, though that does not stop you from re-upscaling them afterwards. I am not sure on this, but my guess is that since you are training such massive images that you will need to do a lot more training since there are more details in the higher res images. On top of the large images taking longer to train with anyway.

I wouldn't mind the more time necessary to train at this greater resolution - I realize it would be more time, but at the same time presumably not so much that it would be days or so. Even if it were to be like 12 hours or so (also taking into account the number of the images being trained), that itself would be bearable. The problem is, however, that the training crashes/ the Connection errored out error occurs almost instantly, and so the training cannot be performed at all. It is a curiosity how, and if at all, this problem could be solved.

@id88viper88id

my obvious guess is that it is due to too high dimensions/a too high resolution (although since I can set it due to having as much as 32 GB of VRAM, I should also be able to train embeddings with as large images).

@Rayregula

The error message does seem to lead to that. I would try using a lower resolution for the moment. I will take a look at the code and see if the values are set incorrectly.

It was my intuitive guess to try perform the training at a lower resolution, and so I did. However, as I had said in my initial post (I realize it could've got lost in all the text), the training crashes at the 512x512 resolution as well. It does last longer, Stable Diffusion saying that it is preparing the dataset, but eventually it crashes at some point of preparing that dataset as well. It is from training at the 512x512 resolution that I get the log file generated (in contrary to the 2048x2048 resolution, in which case it does not get generated at all). Since others with GPUs with less VRAM can, as I understand, train embeddings with the 512x512 resolution, it should not be too much for my Mac Mini M2 Pro, which has 32 GB (as you had properly presupposed in your post, the GPU in Mac Mini is integrated with the CPU, and is not a separate GPU). Besides, I checked that GPU usage when generating a 512x512 image, being upscaled to 2048x2048 in the process of being generated, in Mac's Activity Monitor, and it was using as much even a bit over 31 GB in the process (eventually, it crashed, though, and the image that got generated in spite of that crash was of the 1024x1024 resolution, but that's a different story - I guess that I will only mention briefly that the errors generated at that time were many instances of:

Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Internal Error (0000000e:Internal Error)
    <AGXG14XFamilyCommandBuffer: 0x2d3fc9d20>
    label = <none> 
    device = <AGXG14SDevice: 0x150907c00>
        name = Apple M2 Pro 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x297823400>
        label = <none> 
        device = <AGXG14SDevice: 0x150907c00>
            name = Apple M2 Pro 
    retainedReferences = 1

and finally:

RuntimeError: Invalid buffer size: 16.00 GB). The crash of the training at the 512x512 resolution ends up with the zsh: segmentation fault error, which I had also mentioned in my initial post, so the question is what to do with that? Since, as I had mentioned, others having less VRAM than myself can train with the 512x512 resolution, I don't think it is the 512x512 resolution that is too much and causes the issue, and that I would have to train Stable Diffusion with an even lesser resolution than 512x512. Perhaps it is the number of the images that the training is to be performed with, but my guess is that in order to have supreme generation results later, you could train Stable Diffusion with even thousands of images, and 487 images doesn't seem to be so much as to cause a 32 GB VRAM not to be able to handle it and crash.

@Rayregula

If you have a dedicated GPU the VRAM is dedicated video memory that only the GPU can use and is the main bottleneck in using Stable Diffusion as it's not cheap. My system has a card with 4GB of VRAM, if I wanted one with 32GB of VRAM it would cost me about $10,000 NVIDIA Tesla V100 If you are using NVIDIA's CUDA (very fast but I don't believe Mac's can use it anymore having moved away from NVIDIA, but I could be wrong) then you are required to use the VRAM and not RAM. If you are training using the CPU in your Mac then Stable Diffusion will be using your RAM (not VRAM) Just wanted to clear that up, as for example if you had not stated you were using a Mac but said you had 32GB of VRAM when you actually had 32GB of RAM then your GPU may not have supported SD at all with a 2GB VRAM card but we would not know. In your defense your Mac may just refer them interchangeably if the Integrated GPU within the CPU can use your available RAM  But that is incorrect and just results in problems as you can't be using 32GB of VRAM and still have 32GB of RAM as they are using the same memory. (apologies for the rant)

Yes, the GPU is integrated with the CPU, so when Stable Diffusion is using whatever amount of the 32 GB of RAM available, it uses it as VRAM for its purposes (contemporary Macs, and the M2 Macs in particular, are very efficient at utilizing RAM).

@id88viper88id

I also have a very lengthy Error Report generated by the Mac itself upon the occurrence of this error, but I assume it would be of little to no use in this situation.

@Rayregula Apple Error Report.pdf Apple Error Report.pdf

Unless it's generic (like saying: "an error has occurred") then it could be useful. But we won't know for sure unless we can see it.

The said Error Report is very lengthy and detailed. I've attached it to this post, so that everyone can take a look at it.

@Rayregula

Does basically the same thing. Using ./<file> tells the terminal to run the file and bash <file> just tells it to have bash (previously the default terminal on Mac before they moved to ZSH) open the file which essentially opens it in a fresh terminal. on mac doing zsh <file> would do the same thing. The example of bash is just because some linux OS's (for example debian) have bash as the default terminal.

For me I like to start the ui directly from python for example with python3 webui.py --listen --no-gradio-queue and pass the variables in directly (just have a start script that does it for me like that).

The webui supports multiple operating systems, and each one needs to started a little differently and as such has 3 methods to start it: • webui.sh Linux/MacOS • webui.bat Windows • webui.py (universal python file) (but must pass your arguments manually as it bypasses the automatic setup) When starting either webui.sh or webui.bat they grab the configuration from webui-user.sh or webui-user.bat (depending on OS), run launch.py which loads the options into arguments and starts webui.py The file you modified was the windows configuration file which will have no effect if you are running the webui on MacOS It was a pretty low quality help article and only considered people using windows. Though I don't expect to to help anyway.

Thank you for all the insightful information.

@Rayregula

If that does not seem to be the issue, make this a new issue as I believe it too be different. Fill out the issue form properly (seeing this post I believe you will) giving us the current command line arguments you are using as they do change the behavior. As well as a more appropriate title so people having similar issue can find it (something like: "Training large resolution embedding "Error:DArray dimension length > INT_MAX"").

I can report this as a new issue tomorrow.

Best Regards Apple Error Report.pdf

SemiZhang commented 1 year ago

Can I get confirmation if any of these issues are only happening on http connections that are using --gradio-auth? (except for @foxytocin who has already stated that they had issues either way.)

When gradio queue is enabled and tries to use websockets it attempts to access the login cookie for an https connection and fails to do so as only the one created from http exists.

Apparently a documented gradio issue. I've been trying to fix it for like two weeks. Just wish the people saying to use --no-gradio-queue would have mentioned that was the reason since I need the queue to be working.

Took me like 5 seconds to fix with an ssl cert once I knew that was the problem. I've wasted so much time thinking the queue implementation of the webui was the problem.

Anyway, that was the issue for me and I hope stating it here helps someone else.

Occurred with https connection, NOT using"--gradio-auth".

CLI argument in use "--use-cpu all --no-half --skip-torch-cuda-test"

Adding "--no-gradio-queue" do solve the problem, but the generated picture will disappear from webui while progress is 100%

id88viper88id commented 1 year ago

Update:

Two days ago, I managed to finally perform the embedding training. Among the things I did, which resulted in the training finally working, was updating Stable Diffusion from commits 22bcc7be to commits 5ab7f213, updating Torch from version 1.12.1 to version 2.0.0 (for which the --reinstall-torch flag added after export COMMAND_ARGS= in the webui-user.sh file did not work, and instead I had to delete the venv folder within the Stable-Diffusion-webui main folder first, and then start the Stable Diffusion server as normal, during which startup the latest 2.0.0 version of Torch was automatically downloaded and installed), as well as installing all the libraries for PyTorch, which were indicated as missing, through Home Brew, which were: ninja, eigen, libuv, numpy, openblas, pybind11, python-typing-extensions, pyyaml, and libomp (although I am not confident whether installing those libraries indeed had an impact of making the embedding training finally working). I trained my embedding with 487 images, which I considered as not being relatively many in order to later generate good quality, realistically-looking images, although way more than 10 images, which is commonly suggested as being enough. Unfortunately, though, the resulting images generated are far from looking realistic. Instead, as you can see in the screenshots I have attached, they are brightly and vividly colored, looking somewhat cartoonish, although the unsharp blending of the colors, which are not unific and definitely bordered, makes them not even resemble typical cartoon drawings. Those images are somewhere in-between photos and cartoon drawings, being not either. I’m not sure whether or not this is due to the latest 2.0.0 version of Torch (which users are warned about causing issues with training, although running version 1.12.1, I couldn’t perform any training at all), the 5ab7f213 commit of Stable Diffusion itself, or something else that Stable Diffusion needs to operate? Is this a known issue currently occurring, and should I simply wait for the future updates of Stable Diffusion?

001

002

003

GoodLandxxx commented 1 year ago

try setting arg " --no-gradio-queue"

id88viper88id commented 1 year ago

@GoodLandxxx

try setting arg " --no-gradio-queue"

Uncertain whether or not you replied to mentioning of this discussion by @CoqueTornado in another discussion, or to my post before it about the training result images being too colorful and not realistically-looking, I can say that I had already tried using the --no-gradio-queue argument/flag (doing so back with the previous version of Stable Diffusion itself, and the previous version of Torch as well, when the training would crash at the preparation stage, performing not any actual training at all), both running it directly in the Mac Terminal before running bash webui.sh, as well as adding it at the line with the arguments/flags in the webui-user.sh file, and it did not help with that previous issue. I could try adding it either at the line with the arguments/flags in the webui-user.sh file, or in the .zshrc file, but I doubt it would solve the issue of the training result images being as colorful as shown in the screenshot in my previous post, and, besides that, training is yet a lengthy process (even on Apple Mac Mini M2 Pro), so I could write back about the results of using the --no-gradio-queue argument/flag only some time later

P.S. I can see that the latest version of Stable Diffusion mentioned under Releases as of now is 1.2.1, so I will also update my installation of Stable Diffusion to that particular version, and check how training in that version will work with the --no-gradio-queue argument/flag

congaterori commented 1 year ago

IT said launch.py: error: unrecognized arguments: --no-gradio-queue i run by google colab automatic111 (anythingv4.5), i can't run it recently

Failed to build pycairo

stderr:   error: subprocess-exited-with-error

  × Building wheel for pycairo (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycairo
ERROR: Could not build wheels for pycairo, which is required to install pyproject.toml-based projects

Warning: Failed to install svglib, some preprocessors may not work.
Installing sd-webui-controlnet requirement: fvcore

Launching Web UI with arguments: --share --disable-safe-unpickle --no-half-vae --xformers --enable-insecure-extension --no-gradio-queue --theme dark --remotemoe
usage: launch.py
       [-h]
       [--data-dir DATA_DIR]
       [--config CONFIG]
       [--ckpt CKPT]
       [--ckpt-dir CKPT_DIR]
       [--vae-dir VAE_DIR]
       [--gfpgan-dir GFPGAN_DIR]
       [--gfpgan-model GFPGAN_MODEL]
       [--no-half]
       [--no-half-vae]
       [--no-progressbar-hiding]
       [--max-batch-count MAX_BATCH_COUNT]
       [--embeddings-dir EMBEDDINGS_DIR]
       [--textual-inversion-templates-dir TEXTUAL_INVERSION_TEMPLATES_DIR]
       [--hypernetwork-dir HYPERNETWORK_DIR]
       [--localizations-dir LOCALIZATIONS_DIR]
       [--allow-code]
       [--medvram]
       [--lowvram]
       [--lowram]
       [--always-batch-cond-uncond]
       [--unload-gfpgan]
       [--precision {full,autocast}]
       [--upcast-sampling]
       [--share]
       [--ngrok NGROK]
       [--ngrok-region NGROK_REGION]
       [--enable-insecure-extension-access]
       [--codeformer-models-path CODEFORMER_MODELS_PATH]
       [--gfpgan-models-path GFPGAN_MODELS_PATH]
       [--esrgan-models-path ESRGAN_MODELS_PATH]
       [--bsrgan-models-path BSRGAN_MODELS_PATH]
       [--realesrgan-models-path REALESRGAN_MODELS_PATH]
       [--clip-models-path CLIP_MODELS_PATH]
       [--xformers]
       [--force-enable-xformers]
       [--xformers-flash-attention]
       [--deepdanbooru]
       [--opt-split-attention]
       [--opt-sub-quad-attention]
       [--sub-quad-q-chunk-size SUB_QUAD_Q_CHUNK_SIZE]
       [--sub-quad-kv-chunk-size SUB_QUAD_KV_CHUNK_SIZE]
       [--sub-quad-chunk-threshold SUB_QUAD_CHUNK_THRESHOLD]
       [--opt-split-attention-invokeai]
       [--opt-split-attention-v1]
       [--disable-opt-split-attention]
       [--disable-nan-check]
       [--use-cpu USE_CPU [USE_CPU ...]]
       [--listen]
       [--port PORT]
       [--show-negative-prompt]
       [--ui-config-file UI_CONFIG_FILE]
       [--hide-ui-dir-config]
       [--freeze-settings]
       [--ui-settings-file UI_SETTINGS_FILE]
       [--gradio-debug]
       [--gradio-auth GRADIO_AUTH]
       [--gradio-auth-path GRADIO_AUTH_PATH]
       [--gradio-img2img-tool GRADIO_IMG2IMG_TOOL]
       [--gradio-inpaint-tool GRADIO_INPAINT_TOOL]
       [--opt-channelslast]
       [--styles-file STYLES_FILE]
       [--autolaunch]
       [--theme THEME]
       [--use-textbox-seed]
       [--disable-console-progressbars]
       [--enable-console-prompts]
       [--vae-path VAE_PATH]
       [--disable-safe-unpickle]
       [--api]
       [--api-auth API_AUTH]
       [--api-log]
       [--nowebui]
       [--ui-debug-mode]
       [--device-id DEVICE_ID]
       [--administrator]
       [--cors-allow-origins CORS_ALLOW_ORIGINS]
       [--cors-allow-origins-regex CORS_ALLOW_ORIGINS_REGEX]
       [--tls-keyfile TLS_KEYFILE]
       [--tls-certfile TLS_CERTFILE]
       [--server-name SERVER_NAME]
       [--gradio-queue]
       [--skip-version-check]
       [--no-hashing]
       [--no-download-sd-model]
       [--controlnet-dir CONTROLNET_DIR]
       [--controlnet-annotator-models-path CONTROLNET_ANNOTATOR_MODELS_PATH]
       [--no-half-controlnet]
       [--controlnet-preprocessor-cache-size CONTROLNET_PREPROCESSOR_CACHE_SIZE]
       [--cloudflared]
       [--localhostrun]
       [--remotemoe]
       [--ldsr-models-path LDSR_MODELS_PATH]
       [--lora-dir LORA_DIR]
       [--scunet-models-path SCUNET_MODELS_PATH]
       [--swinir-models-path SWINIR_MODELS_PATH]
launch.py: error: unrecognized arguments: --no-gradio-queue

GounGG commented 1 year ago

Similarly, in version 1.2.1, --no-gradio-queue cannot be used. The worst impact is that when running the UI in kubernetes, the web socket is very unstable and often fails. I guess it is because of the proxy.

│                                                                              │
│   418 │   if cmd_opts.nowebui:                                               │
│   419 │   │   api_only()                                                     │
│   420 │   else:                                                              │
│ ❱ 421 │   │   webui()                                                        │
│   422                                                                        │
│                                                                              │
│ /content/stable-diffusion-webui/webui.py:319 in webui                        │
│                                                                              │
│   316 │   │   │   FastAPI.original_setup = FastAPI.setup                     │
│   317 │   │   │   FastAPI.setup = fastapi_setup                              │
│   318 │   │                                                                  │
│ ❱ 319 │   │   app, local_url, share_url = shared.demo.launch(                │
│   320 │   │   │   share=cmd_opts.share,                                      │
│   321 │   │   │   server_name=server_name,                                   │
│   322 │   │   │   server_port=cmd_opts.port,                                 │
│                                                                              │
│ /home/user/.local/lib/python3.10/site-packages/gradio/blocks.py:1717 in      │
│ launch                                                                       │
│                                                                              │
│   1714 │   │   if not isinstance(self.blocked_paths, list):                  │
│   1715 │   │   │   raise ValueError("`blocked_paths` must be a list of direc │
│   1716 │   │                                                                 │
│ ❱ 1717 │   │   self.validate_queue_settings()                                │
│   1718 │   │                                                                 │
│   1719 │   │   self.config = self.get_config_file()                          │
│   1720 │   │   self.max_threads = max(                                       │
│                                                                              │
│ /home/user/.local/lib/python3.10/site-packages/gradio/blocks.py:1556 in      │
│ validate_queue_settings                                                      │
│                                                                              │
│   1553 │                                                                     │
│   1554 │   def validate_queue_settings(self):                                │
│   1555 │   │   if not self.enable_queue and self.progress_tracking:          │
│ ❱ 1556 │   │   │   raise ValueError("Progress tracking requires queuing to b │
│   1557 │   │                                                                 │
│   1558 │   │   for fn_index, dep in enumerate(self.dependencies):            │
│   1559 │   │   │   if not self.enable_queue and self.queue_enabled_for_fn(fn │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: Progress tracking requires queuing to be enabled.

start flag

python webui.py --xformers --listen --enable-insecure-extension-access --no-progressbar-hiding  --no-gradio-queue

Now I have to use an old version because it depends on a lower version of gradio as well.

id88viper88id commented 1 year ago

In my case (Stable Diffusion run locally on my Mac Mini), using --no-gradio-queue made no difference whatsoever, and for that reason I simply removed it from the export COMMANDLINE_ARGS= line in the webui-user.sh file. Speaking of the training test images, and then the resulting images generated after the training has already finished, being excessively vividly colored and the subjects illustrated in them visually disfigured, I have managed to find out that it was due to overtraining of the model through too many steps performed. The default value for the steps under the Train tab is 100K (i.e. one hundred thousand), which turns out to be way too much, and, as a result of that, Stable Diffusion deviates from the dataset images too much, adding too much noise, and thus making the resulting images look grotesque, and not like, say, realistic photos. When I decreased the number of steps down to 2000 (which is recommended at some websites to experiment at around that value), many of the resulting images did begin to look more like realistic photos - some of them did indeed look like realistic photos, while other were not exactly so, but close to that. As it thus turned out, then, the default value of 100K steps is way too high if you want to generate images that look like realistic photos, or at least images that do not look grotesque due to overcoloring and unnatural, visual disfigurations of the subjects that they illustrate

NureddinFarzaliyev commented 1 year ago

I'm using Stable Diffusion in Google Colab via saving my files on Google Drive and facing the same issue.

First, i've downloaded user-webui.bat file, added set COMMANDLINE_ARGS="--no-gradio-queue" line and replaced the file in google drive with the edited file. Then, i've realised that when i run Install/Update AUTOMATIC1111 repo, it changes the file to its original. So, i've tried to run Install/Update AUTOMATIC1111 repo then change the bat file. But issue keeps ocurring.

NureddinFarzaliyev commented 1 year ago

Now I have to use an old version because it depends on a lower version of gradio as well.

@GounGG Which version can be used to avoid connection errored out issue? Is it possible to use older versions in google colab?

Moryanmeena commented 1 year ago

export COMMANDLINE_ARGS="--no-gradio-queue"

i am also having same issue but i use stable diffusion on google colab. can someone help me on it?

AUTOMATIC1111 / stable-diffusion-webui