WebUI won't recover when running out of VRAM

jasalt commented 2 years ago

Not sure where to ask and whether it's a known issue but the Web UI's don't seem to recover when increasing resolution enough to run out of VRAM. Having to interrupt Docker with ctrl+c and restart it currently to recover from out of memory situation with the RuntimeError: CUDA error: an illegal memory access was encountered error. Same behavior with the earlier default UI a couple days ago, and the 'auto' UI-profile of the latest 1.0.0 release on WSL2/Docker Desktop/RTX3060.

Good job with this project, very convenient way to get experimenting with SD web UI's, thank you.

AbdBarho commented 2 years ago

@jasalt could you please define what it means to "recover"? you mean that the container restart? or that the app continues to function normally?

jasalt commented 2 years ago

To "recover", meaning the app to continue function normally.

AbdBarho commented 2 years ago

@jasalt hmmmm, I don't think I can do anything on the container side, if the container stops on error, you can add restart: on-failure to the docker-compose.yml to restart it. However, it seems that Gradio catches the error but does not recover from it.

In any case, if you just want to restart, you can try docker compose --profile auto restart, same effect as stopping and restarting, with less typing.

jasalt commented 2 years ago

Ok, thank's. Will experiment with it. The auto-profile has been very stable if not passing resolution over 640x640 with 12GB VRAM. Hard to share access to it for others however before there's some solution.

AbdBarho commented 2 years ago

@jasalt You might want to check your config, I can generate a 704 x 704 image on 6GB VRAM. You should be able to go up to 1024 with 12GB?

jasalt commented 2 years ago

Was getting 512x512 with all optimizations off that is (12GB VRAM), tested with defaults of hlky-profile now and got up to 832 x 960. Didn't inspect the difference in render quality much but it's not clearly visible at least.

Adding restart: on-failure to the profile didn't work for restarting after running out of VRAM but restart: always does. This would reset the gradio public share url however but that shouldn't be a problem with proper reverse proxy setup. Example config change for hlky-profile which restarts after error:

  hlky:
    <<: *base_service
    profiles: ["hlky"]
    restart: always
    build: ./services/hlky/
    environment:
      - CLI_ARGS=--optimized-turbo

AbdBarho commented 2 years ago

@jasalt the auto profile has more optimizations, maybe you can have larger images with it.

For the restart, you would probably need to create an issue in the respective UI repository to handle the errors gracefully, then restart: is not necessary anymore.

jasalt commented 2 years ago

Ok, will test that too, liked the auto profile's live preview a lot.

Thank's for the workaround. I'll keep an eye on the upstream repositories.

jasalt commented 2 years ago

Seems like the restart: always only works with hlky profile which exits docker compose with code 0 when running out of VRAM:

webui-docker-hlky-1  | !!Runtime error (txt2img)!!
webui-docker-hlky-1  |  CUDA error: an illegal memory access was encountered
webui-docker-hlky-1  | CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
webui-docker-hlky-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
webui-docker-hlky-1  | exiting...calling os._exit(0)
webui-docker-hlky-1 exited with code 0

After this it restarts. The auto profile handles out of VRAM error differently, it does not exit and restart but hangs in Docker compose prompt like so:

...
webui-docker-automatic1111-1  |   File "/stable-diffusion-webui/modules/sd_samplers.py", line 43, in sample_to_image
webui-docker-automatic1111-1  |     x_sample = 255. * np.moveaxis(x_sample.cpu().numpy(), 0, 2)
webui-docker-automatic1111-1  | RuntimeError: CUDA error: an illegal memory access was encountered
webui-docker-automatic1111-1  | CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
webui-docker-automatic1111-1  | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
webui-docker-automatic1111-1  |

AbdBarho commented 2 years ago

Yeah, the hlky fork handles the errors explicitly, and exists gracefully here. the auto fork just leaves the error to gradio, which probably does nothing and leaves the app in an invalid state.

jasalt commented 2 years ago

Yea, while it's getting bit complex with differences on that level I got an ugly workaround together for running auto profile with "self recovery" when running out of memory. Simply tailing the logs and running the restart command with awk.

After starting up the auto profile normally, this would be run on another terminal (in WSL):

docker compose --profile auto logs --follow --tail 5 | awk '/illegal memory access was enc
ountered/ {system("docker compose --profile auto restart")}'

After a crash, it should kick the docker compose back up again and the original terminal will exit but logs can be watched there again with docker compose --profile auto logs --follow. Bit glitchy experience but seems to work for now.

AbdBarho commented 2 years ago

Wizard! this still has the cost of reloading the entire app / models from scratch, which takes roughly 20 seconds on my machine, but I think it is still better than nothing.

On thing you could probably do if you really want to hack it away, is have some code that runs as part of the build that adds a try catch to the code responsible for all gpu calls

I already have something similar for adding a link to this repo, maybe you can try as well.

Or just MR to the main repo with your solution.

jasalt commented 2 years ago

I eyed these parts a bit but Iwill keep my hands off from it for now as it works well enough for me and gotta to get back working on other stuff.

On a side note, the default auto optimizations with 3060 12GB give larger resolution but the render speed drops quite a bit from 4-5 it/s to around 1-2 it/s. GPU CUDA activity on Task Manager going up and down like a saw wave during the render while it's keeping near constant 100% with optimization flags removed from docker-compose.yml. Guessing that it's expected to behave that way with optimizations. Pretty pleased running without them and using the upscaling methods to get to around 1280x1280.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 7 days with no activity.

AbdBarho / stable-diffusion-webui-docker

WebUI won't recover when running out of VRAM #51