webui.py crashes in Docker on a Windows (WSL) machine

ChrisAcrobat commented 2 years ago

📢 Discussion from #71 continues here.

It crashes somewhere between the log sd | LatentDiffusion: Running in eps-prediction mode and sd | DiffusionWrapper has 859.52 M params.. I have tried a clean Docker install by deleting all containers, images and volumes and then run:

docker system prune -a
git clone https://github.com/hlky/stable-diffusion.git
cd stable-diffusion
docker-compose up

But nothing has worked so far and is getting stuck at:

[+] Running 1/0
 - Container sd  Created                                                                                           0.0s
Attaching to sd
sd  |      active environment : ldm
sd  |     active env location : /opt/conda/envs/ldm
sd  | Validating model files...
sd  | checking model.ckpt...
sd  | model.ckpt is valid!
sd  |
sd  | checking GFPGANv1.3.pth...
sd  | GFPGANv1.3.pth is valid!
sd  |
sd  | checking RealESRGAN_x4plus.pth...
sd  | RealESRGAN_x4plus.pth is valid!
sd  |
sd  | checking RealESRGAN_x4plus_anime_6B.pth...
sd  | RealESRGAN_x4plus_anime_6B.pth is valid!
sd  |
sd  | entrypoint.sh: Launching...'
sd  | Loaded GFPGAN
sd  | Loaded RealESRGAN with model RealESRGAN_x4plus
sd  | Loading model from models/ldm/stable-diffusion-v1/model.ckpt
sd  | Global Step: 470000
sd  | LatentDiffusion: Running in eps-prediction mode
sd  | entrypoint.sh: Process is ending. Relaunching in 0.5s...
sd  | /sd/entrypoint.sh: line 89:    29 Killed                  python -u scripts/webui.py
sd  | entrypoint.sh: Launching...'
sd  | Relaunch count: 1

And then it continuing relaunching until I close the program.

Is there anyone running Windows that have got the Docker solution to work?

altryne commented 2 years ago

Can you try running the webui.py directly and not through relauncher.py and put a stacktrace you get with the error?

ChrisAcrobat commented 2 years ago

Running python -u scripts/webui.py has provided two results for me:

Traceback (most recent call last):
  File "/sd/scripts/webui.py", line 3, in <module>
    from frontend.frontend import draw_gradio_ui
ModuleNotFoundError: No module named 'frontend'

That one can have been a badly timed cloned of the repo. ☝️

Loaded GFPGAN
Loaded RealESRGAN with model RealESRGAN_x4plus
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
Killed

altryne commented 2 years ago

Yeah we split the frontend to it's own module before, so the first issue is fixed by pulling the latest. The second one doesn't throw any errors? Just Killed? 😩

yourjelly commented 2 years ago

I am getting the same issue in a docker running on linux.

altryne commented 2 years ago

@yourjelly could be a gradio issue with ports and such inside docker.

if you don't mind testing, download webui_playground.py from https://github.com/hlky/stable-diffusion-webui and put in the same directory, and run python webui_playground.py and see if that crashes as well?

toboshii commented 2 years ago

Getting the same issue on Linux installed using this guide/script.

Playground runs fine for me.

yourjelly commented 2 years ago

Yep, likewise, playground runs fine. Although, I can only see it from the public link not my internal one. That might be a misconfiguration on my end though.

altryne commented 2 years ago

Yep, likewise, playground runs fine. Although, I can only see it from the public link not my internal one. That might be a misconfiguration on my end though.

Yeah that's prob due to gradio issues with debug.

I'm out of ideas, it's really hard to debug without an error entirely 😅

The best I can do is suggest to drop random print("GOT HERE") or sys.exit() and see if you get to that point or no to try and figure out if it's possible to load these things inside docker

@hlky any other ideas?

toboshii commented 2 years ago

In my case (and I assume the same for the others as well) it's because it's getting killed by the kernel OOM killer:

[1888018.327103] Out of memory: Killed process 2510239 (python) total-vm:20822252kB, anon-rss:6211836kB, file-rss:127360kB, shmem-rss:10240kB, UID:1000 pgtables:16556kB oom_score_adj:0

It's trying to allocate ~20GB of ram, and I only have about 6GB available.

yourjelly commented 2 years ago

I have observed this too, eating ram before being killed. I followed it to webui.py model = load_model_from_config(config, opt.ckpt) Then that takes it to util.py where 'get_obj_from_str(string, reload=False)' runs twice before it dies There might be more it gets through but thats how close ive got so far.

I'm gonna stick another 16GB of ram in my server and see if it gets further.

yourjelly commented 2 years ago

Yep, I got past that point now. It's just trying to eat too much ram. Hovering around 10GB of usage afterwards.

altryne commented 2 years ago

@yourjelly have you tried running it with the --turbo mode or the --optimized-turbo flags on and see if it's better?

altryne commented 2 years ago

cc @toboshii

yourjelly commented 2 years ago

Will do, because my GPU ran out of memory when i tried to text2img

!!Runtime error (txt2img)!! 
 CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 7.79 GiB total capacity; 5.64 GiB already allocated; 1008.06 MiB free; 5.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
exiting...calling os._exit(0)

yourjelly commented 2 years ago

Optimized Turbo works using 98% of GPU ram

altryne commented 2 years ago

ok progress! It doesn't crash anymore without a notification, it's a good thing :) I guess we found the issue, now question is, given --turbo, should we close this issue?

JoshuaKimsey commented 2 years ago

Getting the same issue on Linux installed using this guide/script.

Script creator here, I'm really not sure what to make of this issue? It's not associated with my script itself, but might be tied back to a faulty conda env? Make sure you're using the newest version of my script and make sure it pulls in the latest updates from the repo. Choose no on the previous parameters question, then yes on the do you want to update screen. If that fails, delete the conda env, conda env remove -n lsd, run the script again, select no on the previous parameters and let it generate a new one.

If it still fails after that, then it is definitely something tied back either to your conda installation itself or a niche bug inside of the python code, most likely the latter at that point since it will at least partially run.

ChrisAcrobat commented 2 years ago

[1888018.327103] Out of memory: Killed process 2510239 (python) total-vm:20822252kB, anon-rss:6211836kB, file-rss:127360kB, shmem-rss:10240kB, UID:1000 pgtables:16556kB oom_score_adj:0

@toboshii I haven't seen that error.

@yourjelly have you tried running it with the --turbo mode or the --optimized-turbo flags on and see if it's better?

I guess we found the issue, now question is, given --turbo, should we close this issue?

Is that a Docker compose flag? Or could it be included when ’Relaunch count’? If the problem can't be optimised away, then this solution should at least be mentioned in the wiki.

I haven't tried the solution yet (or searched about it, just woke up 🌄), but I will check it out as soon as I can!

yourjelly commented 2 years ago

I believe it should be added here python -u scripts/webui.py --optimized-turbo in the entrypoint.sh file for dockers

toboshii commented 2 years ago

Script creator here, I'm really not sure what to make of this issue?

I don't think it's anything off with your script, I only mentioned that to make it clear I was running on bare metal and was using a "supported" method of installation (I hadn't just hacked stuff together myself :rofl:). This was in a clean miniconda install, I tried rebuilding the env, same issue.

have you tried running it with the --turbo mode or the --optimized-turbo flags on and see if it's better?

--turbo doesn't seem to exist and --optimized-turbo made no difference in my case.

@toboshii I haven't seen that error.

Did you look in the kernel logs? If it's the same issue you should be able to find it there using dmesg (assuming WSL provides for that, not sure, haven't used Windows in like 12 years :sweat_smile:)

All in all I'm not really sure this is a "bug" or "issue" I think maybe in my and others cases we just greatly underestimated the amount of memory needed to load the models. As @yourjelly did, I moved to trying it on another machine with 32GB free and had no issues. The original machine I'm trying it on (my desktop) has 16GB and generally around 6-8GB free. From what I see on both machines it needs a minimum of about 26GB free to load the model initially and then idles around 10GB as @yourjelly mentioned, which seems pretty odd given the model is ~4GB but honestly this is my first foray into AI stuff outside of collabs, etc so maybe this is expected.

ChrisAcrobat commented 2 years ago

dmesg in WSL: dmesg: read kernel buffer failed: Operation not permitted 🙂 I'll be monitoring the Windows logs, just to confirm the hypothesis.

hlky commented 2 years ago

@toboshii @altryne meant --optimized or --optimized-turbo

Ideally you need 8gb+ vram and 16gb+ ram

The --optimized option is designed for 4gb vram and --optimized-turbo for 6-8gb vram, and will increase ram usage compared to without either of those options.

ChrisAcrobat commented 2 years ago

My laptop has 16 GB RAM installed and 8 GB video memory. I do not want to compare, because I understand they are different projects with diffident goals, but I have able with hat to start (one, all I have tried) other Stable Diffusion-project. EDIT: --optimized-turbo did not solve it for me.

ChrisAcrobat commented 2 years ago

Just read this #134. I will try again.

ChrisAcrobat commented 2 years ago

Sadly it still crashed.

sd | saved RealESRGAN_x4plus_anime_6B.pth
sd | entrypoint.sh: Launching...
sd | Loaded GFPGAN
sd | Loaded RealESRGAN with model RealESRGAN_x4plus
sd | Loading model from models/ldm/stable-diffusion-v1/model.ckpt
sd | Global Step: 470000
sd | UNet: Running in eps-prediction mode
sd | entrypoint.sh: Process is ending. Relaunching in 0.5s...
sd | /sd/entrypoint.sh: line 89:   559 Killed                  python -u scripts/webui.py --optimized-turbo
sd | entrypoint.sh: Launching...
sd | Relaunch count: 1

oc013 commented 2 years ago

Good run from scratch on Ubuntu 20.04 below with latest pull. Maybe this can help you debug.

sd  | entrypoint.sh: Launching...
sd  | python -u scripts/webui.py --no-verify-input --optimized-turbo
sd  | Downloading: "https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth" to /opt/conda/envs/ldm/lib/python3.8/site-packages/facexlib/weights/detection_Resnet50_Final.pth
sd  | 
100%|██████████| 104M/104M [00:05<00:00, 19.3MB/s] 
sd  | Downloading: "https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth" to /opt/conda/envs/ldm/lib/python3.8/site-packages/facexlib/weights/parsing_parsenet.pth
sd  | 
100%|██████████| 81.4M/81.4M [00:04<00:00, 19.6MB/s]
sd  | Loaded GFPGAN
sd  | Loaded RealESRGAN with model RealESRGAN_x4plus
sd  | Loading model from models/ldm/stable-diffusion-v1/model.ckpt
sd  | Global Step: 470000
sd  | UNet: Running in eps-prediction mode
sd  | CondStage: Running in eps-prediction mode
sd  | Downloading: "https://github.com/DagnyT/hardnet/raw/master/pretrained/train_liberty_with_aug/checkpoint_liberty_with_aug.pth" to /root/.cache/torch/hub/checkpoints/checkpoint_liberty_with_aug.pth
100%|██████████| 5.10M/5.10M [00:00<00:00, 17.1MB/s]
Downloading: 100%|██████████| 939k/939k [00:00<00:00, 7.05MB/s]
Downloading: 100%|██████████| 512k/512k [00:00<00:00, 4.68MB/s]
Downloading: 100%|██████████| 389/389 [00:00<00:00, 482kB/s]
Downloading: 100%|██████████| 905/905 [00:00<00:00, 1.04MB/s]
Downloading: 100%|██████████| 4.31k/4.31k [00:00<00:00, 5.01MB/s]
Downloading: 100%|██████████| 1.59G/1.59G [01:32<00:00, 18.4MB/s]
sd  | FirstStage: Running in eps-prediction mode
sd  | making attention of type 'vanilla' with 512 in_channels
sd  | Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
sd  | making attention of type 'vanilla' with 512 in_channels
sd  | Running on local URL:  http://localhost:7860/
sd  | 
sd  | To create a public link, set `share=True` in `launch()`.

Maybe set this up on your system without docker to determine it's actually a docker issue

ChrisAcrobat commented 2 years ago

Slight progress. The UI is now starting in --optimized, but with optimizedSD.ddpm.UNet removed. This means of cause that UI is not working, but now I know it only is optimizedSD.ddpm.UNet (while --optimized) that causing problems for me. Are there other models I can switch it out for? Any tip on how to do so?

ChrisAcrobat commented 2 years ago

I have seen that Docker has an argument for disabling the out-of-memory watcher (--oom-kill-disable). I'm trying to get it to work and will reply later.

Sygil-Dev / stable-diffusion

webui.py crashes in Docker on a Windows (WSL) machine #113