[Bug]: slow generation on aws ec2 g5 instance (a10g)

liao02x commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

I was trying to set up the stable diffusion service api from an aws ec2 g5 instance which is using an a10g GPU. It went up correctly and generated the images. I used to set up the stable diffusion service from the same instance using diffusers and flask, and I was going to replace that with sd-webui. However, the generation speed went down a lot (7it/s vs 2 it/s).

From diffusers I used the default setup for stable diffusion 2.1. For sd-webui, I downloaded the checkpoint for stable diffusion 2.1 and imported it. The sampler is using DPM++ 2M Karras. The model and sampler should be the same choice and setup from diffusers.

A few things I tried:

I enabled xformers and tcmalloc, improved the speed a little bit.
I checked the nvidia-smi and it's using the GPU correctly.
The system image was a ubuntu GPU pytorch 2.0 image, with cuda and cudnn installed.
I tried to switch to an AMI GPU pytorch 2.0 image and set up everything. The speed doesn't change.
Searched in the github issues and added a bunch of flags I found --opt-sub-quad-attention --opt-channelslast --opt-sdp-no-mem-attention. The speed doesn't change.
Try to use fp16 checkpoint on sd-webui (so it's not running the same model). Didn't see much difference in the speed.
Thought it could be python issue (default is 3.7 on the systems). Tried to build and use 3.10 on the instance and tried all above. The speed doesn't change.

I was wondering if anyone else is also running it on an aws instance and have the same issue. The only issue I found talked about problems with starting the service, which isn't my case.

Steps to reproduce the problem

Go to aws and start an g5.xlarge instance with ubuntu GPU pytorch 2.0 image
Install the sd-webui on it and start the service
Test the generation. ~2it/s compared to running diffusers on the same instance(7it/s)

What should have happened?

I would expect it to have a similar generation speed

Version or Commit where the problem happens

v1.4.0

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

xformers

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

export COMMANDLINE_ARGS="--share --no-half --xformers --api --opt-sub-quad-attention --opt-channelslast --opt-sdp-no-mem-attention"

List of extensions

No

Console logs

[ec2-user@ip-172-31-30-213 stable-diffusion-webui]$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on ec2-user user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Accelerating launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `1`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Python 3.10.0 (default, Jun 28 2023, 01:10:35) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
Version: v1.4.0
Commit hash: 394ffa7b0a7fff3ec484bcd084e673a8b301ccc8
Installing requirements
Launching Web UI with arguments: --share --no-half --xformers --api --opt-sub-quad-attention --opt-channelslast --opt-sdp-no-mem-attention
Loading weights [dcd690123c] from /home/ec2-user/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.safetensors
preload_extensions_git_metadata for 7 extensions took 0.00s
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://af7aa4e8b13ed36bc1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Startup time: 13.2s (import torch: 2.0s, import gradio: 1.9s, import ldm: 2.0s, other imports: 1.7s, load scripts: 0.6s, create ui: 0.7s, gradio launch: 4.0s, add APIs: 0.1s).
Creating model from config: /home/ec2-user/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/configs/stable-diffusion/v2-inference-v.yaml
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
Applying attention optimization: xformers... done.
Textual inversion embeddings loaded(0): 
Model loaded in 47.1s (load weights from disk: 1.4s, find config: 32.0s, create model: 0.2s, apply weights to model: 12.5s, apply channels_last: 0.3s, move model to device: 0.5s, calculate empty prompt: 0.2s).
100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:16<00:00,  2.95it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:16<00:00,  2.98it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:16<00:00,  2.95it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.87it/s]
Total progress: 100%|████████████████████████████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.95it/s]

Additional information

No response

pio-mahusai commented 1 year ago

Same problem. I have a stable diffusion webui v1.4.0 instance on ec2 g5.2xlarge which runs very slow. It takes around 15 seconds of waiting before initializing the image generation. However, this problem is not experienced on my g4dn.xlarge instance of same webui version.

liao02x commented 1 year ago

Same problem. I have a stable diffusion webui v1.4.0 instance on ec2 g5.2xlarge which runs very slow. It takes around 15 seconds of waiting before initializing the image generation. However, this problem is not experienced on my g4dn.xlarge instance of same webui version.

That's very interesting. I was using g5 instances since I started and I never tried other instance types. Let me try a g4dn and see how it works.

liao02x commented 1 year ago

Same problem. I have a stable diffusion webui v1.4.0 instance on ec2 g5.2xlarge which runs very slow. It takes around 15 seconds of waiting before initializing the image generation. However, this problem is not experienced on my g4dn.xlarge instance of same webui version.

That's very interesting. I was using g5 instances since I started and I never tried other instance types. Let me try a g4dn and see how it works.

Tried on g4dn instance with same setup and the speed didn't get better. It's using a weaker GPU so the generation is slower than g5, and the webui is slower than diffusers (1.3it/s vs 3it/s)

logs from webui:

ubuntu@ip-172-31-89-24:~/stable-diffusion-webui$ ./webui.sh 

################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on ubuntu user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Using TCMalloc: libtcmalloc_minimal.so.4
Python 3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]
Version: v1.4.0
Commit hash: 394ffa7b0a7fff3ec484bcd084e673a8b301ccc8
Installing requirements
Launching Web UI with arguments: --share --no-half --xformers --api --opt-sub-quad-attention --opt-channelslast --opt-sdp-no-mem-attention
Loading weights [dcd690123c] from /home/ubuntu/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.safetensors
preload_extensions_git_metadata for 7 extensions took 0.00s
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://c88ac7807323cc8c9e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Startup time: 13.5s (import torch: 2.6s, import gradio: 1.8s, import ldm: 2.8s, other imports: 1.6s, load scripts: 0.5s, create ui: 0.6s, gradio launch: 3.5s, add APIs: 0.1s).
Creating model from config: /home/ubuntu/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/configs/stable-diffusion/v2-inference-v.yaml
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
Applying attention optimization: xformers... done.
Textual inversion embeddings loaded(0): 
Model loaded in 10.3s (load weights from disk: 0.9s, find config: 4.8s, create model: 0.2s, apply weights to model: 2.5s, apply channels_last: 0.4s, move model to device: 1.2s, calculate empty prompt: 0.3s).
100%|██████████████████████████████████████████████████████████████████████| 50/50 [01:08<00:00,  1.37s/it]
Total progress: 100%|██████████████████████████████████████████████████████| 50/50 [01:16<00:00,  1.54s/it]
100%|██████████████████████████████████████████████████████████████████████| 50/50 [01:05<00:00,  1.30s/it]
Total progress: 100%|██████████████████████████████████████████████████████| 50/50 [01:06<00:00,  1.32s/it]
100%|██████████████████████████████████████████████████████████████████████| 50/50 [01:07<00:00,  1.34s/it]
Total progress: 100%|██████████████████████████████████████████████████████| 50/50 [01:08<00:00,  1.36s/it]
Total progress: 100%|██████████████████████████████████████████████████████| 50/50 [01:08<00:00,  1.36s/it]

logs from API using diffusers:

(venv) ubuntu@ip-172-31-89-24:~/stable-diffusion-webui$ python3.10 server.py 
INFO:     Started server process [2578]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
100%|██████████████████████████████████████████████████████████████████████| 50/50 [00:16<00:00,  2.95it/s]
INFO:     127.0.0.1:50284 - "POST /design/ HTTP/1.1" 200 OK
100%|██████████████████████████████████████████████████████████████████████| 50/50 [00:16<00:00,  3.02it/s]
INFO:     127.0.0.1:60614 - "POST /design/ HTTP/1.1" 200 OK

ClashSAN commented 1 year ago

Run again with just --share --api --xformers --no-half is mutually exclusive and negates possible speedup, from either xformers or opt-sdp-no-mem-attention. (either one or the other is active seen in console)

liao02x commented 1 year ago

5. --opt-sub-quad-attention --opt-channelslast --opt-sdp-no-mem-attention

This is working. By taking --no-half off the speed is close to diffusers generations. Thanks

happyeungin commented 1 year ago

@liao02x have you ever run into problem when you are using diffusers?

I am running on ec2 g5 instance, but it is very slow even I have already enable xformer

I have already use fp=16 for half precision.

liao02x commented 1 year ago

@liao02x have you ever run into problem when you are using diffusers?

I am running on ec2 g5 instance, but it is very slow even I have already enable xformer

I have already use fp=16 for half precision.

I don't have issues with diffusers. I tested on both ami image with pytorch2 and ubuntu image with pytorch2, both of them work and the speed is ~7it/s with default settings.

AUTOMATIC1111 / stable-diffusion-webui