NickLucche / stable-diffusion-nvidia-docker

GPU-ready Dockerfile to run Stability.AI stable-diffusion model v2 with a simple web interface. Includes multi-GPUs support.
MIT License
359 stars 43 forks source link

Model Parallel assertion error #17

Open NickLucche opened 2 years ago

NickLucche commented 2 years ago

I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/

[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu Loading model.. Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3] Free GPU memory (per device): [8665, 8665, 8665, 8665] Search has found that 17 model(s) can be split over 4 device(s)! Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}] Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34k/1.34k [00:00<00:00, 739kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5k/12.5k [00:00<00:00, 12.9MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 182kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 307kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 2.48MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:07<00:00, 77.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 117kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 122kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 317kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 72.5MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 58.8MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 563kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 62.3MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.07MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:22<00:00, 75.2MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.2k/71.2k [00:00<00:00, 37.7MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 300kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 74.1MB/s] Traceback (most recent call last): File "server.py", line 9, in <module> from main import inference, MP as model_parallel File "/app/main.py", line 55, in <module> n_procs, devices, model_parallel_assignment=model_ass, **kwargs File "/app/parallel.py", line 149, in from_pretrained assert d AssertionError

It's loading a lot of models, 17 in fact. Might that be the culprit?

Anyways, if I can participate in testing or help in any way, I'm here to do so :) Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.

Originally posted by @huotarih in https://github.com/NickLucche/stable-diffusion-nvidia-docker/issues/8#issuecomment-1264480363

NickLucche commented 2 years ago

Ok I've taken a first look at this bug, it seems indeed memory related (it shouldn't try to allocate 17 models given the memory capacity). I'll take some more time to dig into that and probably refactor this model parallel feature, I think it has gotten far from the intended usage to support small GPUs and I've probably tried to optimize some behavior that shouldn't be present.

Meanwhile, I suggest you use the standard "Data Parallel" mode that's more stable and much more suitable for a setup in which you have GPUs with same memory size! Feel free to report issues on that if you find any.

To do so, simply run the same command omitting the -e MODEL_PARALLEL=1 option.

docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:latest This should spawn 4 workers that evenly share the workload and easily provide a speedup of 4, up to memory limits (e.g. if you generate 4 images it will take you the same time as generating one).

Thanks once again for submitting bug reports!

mchaker commented 2 years ago

I'm still interested in the smaller or heterogeneous GPU feature! I think it would be great to be able to use my 3070s too (and good for hobbyists) 😊

NickLucche commented 2 years ago

Sorry for the long delay, I fixed most of things here more than a week ago, but I'm still dealing with a nasty bug that I haven't quite figured out yet (tensor sizes don't match after wrapping the different components). I hope I can get back to you with good news soon!

mchaker commented 2 years ago

Sending good energy and support!! :)