Closed mchaker closed 11 months ago
Looking into that!
My idea was the following: the model gets split into 4 main components $\text{unet}_e$, $\text{unet}_d$, text_embedding and vae. These 4 components must be distributed over $N$ GPUs, and possibly replicated more than once so that you can run multiple models split over multiple devices (this is like combining Data and model parallel).
I figured I need to do something like this $x_1 \text{unet}_e + y_1 \text{unet}_d + z_1 \text{text-embedding} + k_1 \text{vae} \leq G_1$. ... $x_N \text{unet}_e + y_N \text{unet}_d + z_N \text{text-embedding} + k_N \text{vae} \leq G_N$.
with $\sum x_i=\sum y_i=\sum z_i=\sum k_i$ $x, y, z, k \geq 0$ $x, y, z, k \in Z^n$
where $G_i$ is the memory capacity of some GPU $i$, while $\text{unet}_e$ represents the memory required by the component. This looks like a ILP problem to me. Unfortunately, I don't know how to solve it. The alternative is to use some other greedy approach (start placing components in GPUs where there's enough memory and see where it goes) or brute force, generating all possible combinations. This thing may be overkill for the purpose of this project, so I'll think about it some more and come up with a feasible idea.
Thank you so much for this insight! It sounds like an OS scheduling problem, hmm... Would it be possible to solve the ILP problems/optimization problem using GPUs? They seem ideal for the task. (I don't mean to be facetious -- I mean that given we are already working with GPUs, perhaps we can use them as part of a startup calculation, then unload the calculation once results are found, and load in the stablediffusion models etc?)
I found a few resources (but I don't understand them fully):
Thanks for your help, I see your point that would definitely come in handy, but atm I'm not too "scared" by the scale of the problem to turn to GPU computing, I think getting to $N=128$ GPUs would be the biggest use-case we can have, with $N*4$ variables. I'm mostly concerned in figuring out whether my formulation is correct or if there's something much simpler that can be used do to this, perhaps with reference to some similar work..?
If you're using the huggingface diffusers library, would using huggingface accelerate work?
Or is that only for training models, and not executing them?
Yep during training you have to keep weights update synchronized so it makes sense to use a framework.
I'll go with the brute force solution for now, I'll keep you posted.
Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr ๐). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off. Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .
This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.
Okay I can generate the possible combinations of components-to-GPUs assignment, works well (in terms of speed) if we cut down the number of assignments at each step from a theoretical max of N^4 to smt like a random sample of 2 of them (ikr ๐). This is a greedy approach so we give up on optimality, but I believe it's a fair trade-off. Furthermore, the max number of models that can be split can be limited not only by the amount of combined available VRAM, but also by the number of processes that must handle them (e.g. I took n_cpus*2) .
This is probably an overkill of analysis since I doubt it will be used to generate images on a cluster of 128 A100, but perhaps it can turn out to be useful for some other projects by simply scaling-up the random search I've done here.
๐ ๐ ๐ฅณ WOOO!!! I can't express in plaintext how exciting this is, even if not optimal!
This is a great first step that takes skill to pull off. The broader community may be able to help optimize from here.
128 A100? Not yet, but perhaps if someone makes a job distributor or some kind of kubernetes/distributed scheduler integration for stablediffusion... (looking at myself, maybe)
@NickLucche Would this help at all?
https://cundy.me/post/blog_post_running_gpt_j_on_several_smaller_gpus/
What about setups with nvlink? Does it make it easier to pool memory or same thing?
Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that. Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.
Nvlink looks like a cool idea but I'm not sure whether it supports finding the best assignments for multiple models parts. I should look into that. Anyway, I only need to add a few minor things here https://github.com/NickLucche/stable-diffusion-nvidia-docker/tree/model-parallel before testing this approach. Should have updates in the weekend.
Wait, so does this fork of yours make any dual GPU setup behave like Nvlink? And is there any benefit in running this fork with nvlink compared to any other SD forks that do not have your special multi gpu code?
No not really, this is high level code (pytorch level, not nvidia firmware) that's specific to this stable diffusion model. It tries to find an optimal way to distribute the (predefined, fixed) model components across multiple GPUs and takes care of moving tensors from one GPU to the next one.
It should support splitting multiple models. I know it may sound confusing, but it's really just Data+Model Parallel.
I'm just an artist, definitely is confusing to me lol. I found this guy talking about multi GPU: https://youtu.be/hBKcL8fNZ18?list=PLzSRtos7-PQRCskmdrgtMYIt_bKEbMPfD&t=436
No clue if it's helpful at all
No worries thanks for your help, I'll try to make it so that you don't have to worry about how it runs under the hood, hopefully it'll simply work!
I'm willing to help test things on my hardware pool if you want some help :)
I'm willing to help test things on my hardware pool if you want some help :) I was counting on it, really appreciate your help! ๐๐ป
I have a somewhat stable build that can be tested with:
docker run --name stable-diffusion --gpus all -it -e DEVICES=all -e MODEL_PARALLEL=1 -e TOKEN=<YOUR_TOKEN> -p 7860:7860 nicklucche/stable-diffusion:multi-gpu
I am expecting some bugs here and there, so please report the logs/error that appear in the console!
Current build has some limitations when MODEL_PARALLEL=1
is set (everything else should work as usual when MODEL_PARALLEL
is not set):
devices
; I am thinking about a simpler mode in which we only spread a single model over multiple gpus, that can be turned on by the user Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..
I'm excited already! ๐ waiting for the downloads to finish....
@NickLucche does it matter which noise scheduler is used?
This is so exciting!
I generated FOUR 512x512 images in the time it used to take me to generate ONE 512x512 image (on a P100)
Now to try 14 images...
51it [00:07, 6.79it/s]
51it [00:18, 2.80it/s]
51it [00:18, 2.79it/s]
51it [00:18, 2.79it/s]
I think I found a bug! ๐
Hardware environment:
Loading model..
Creating and moving model to cuda:3 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:2 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:5 (Tesla P40)..
Creating and moving model to cuda:0 (NVIDIA GeForce RTX 3070 Ti)..
Creating and moving model to cuda:1 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:4 (Tesla P100-PCIE-16GB)..
Creating and moving model to cuda:6 (Tesla P40)..
When trying to generate 14 images with the following parameters:
prompt: "Multiple nvidia Tesla GPUs"
number of images: 14
steps: 50
height: 512
width: 512
guidance scale: 7.5
seed: 0/default
NSFW filter unchecked
noise scheduler: PNDM
the first GPU fails because it only has 8GB VRAM, which is fine, whatever.
However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing -- the other GPUs finish their work, but the first failed GPU process just sits there in an error state... (see the 0it [00:00, ?it/s]
at the beginning) and no images appear in the gradio UI (since the job does not complete)
0it [00:00, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/utils.py", line 80, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 151, in forward
hidden_states=sample, temb=emb, encoder_hidden_states=encoder_hidden_states
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_blocks.py", line 505, in forward
hidden_states = attn(hidden_states, context=encoder_hidden_states)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 168, in forward
x = block(x, context=context)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 196, in forward
x = self.attn1(self.norm1(x)) + x
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/attention.py", line 254, in forward
attn = sim.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 7.80 GiB total capacity; 3.10 GiB already allocated; 1.25 GiB free; 5.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
51it [00:32, 1.58it/s]
51it [00:32, 1.58it/s]
51it [00:32, 1.58it/s]
51it [00:32, 1.56it/s]
51it [00:35, 1.45it/s]
51it [00:35, 1.44it/s]
Is there a way to solve that? Perhaps scaling what is scheduled to fit on a per-card basis? (if VRAM amounts differ by card -- which others do, e.g. P100 is 16GB VRAM and P40 is 24GB VRAM)
Thanks a lot for testing out that out so promptly! Nice setup btw ๐ฎ
@NickLucche does it matter which noise scheduler is used?
No you can choose any of the available ones it shouldn't effect speed sensibly.
However, the main bug is that when the first GPU fails, it blocks the rest of the render request from completing
Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the devices
, it's not trivial but I could:
Anyway, does discarding the small device (by setting -e DEVICES=1,2...
) solve your issue in generating 14 images?
Same parameters as the last test, this time -e DEVICES=1,2,3,4,5,6
when initially starting the container with docker run ...
: success!
51it [00:33, 1.54it/s]
51it [00:33, 1.54it/s]
51it [00:33, 1.54it/s]
51it [00:35, 1.43it/s]
51it [00:47, 1.07it/s]
51it [00:50, 1.02it/s]
UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!
Trying it with DDIM: Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image NickLucche parallelism: approximately 5.29 sec/image ๐
My favorite image generated from this test, lol: TEDLA
Yeah unfortunately that is how it is supposed to work atm, the small GPU can be a bottleneck for the whole system if included among the
devices
, it's not trivial but I could:
- move components away from the small GPU "on-the-fly", but that would make generating images painfully slow
- introduce a bias/preference toward big GPUs when searching for the assignment: this is less trivial to implement but would be by far the best choice; anyway, I can't guarantee that it would work for any amount of images
How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?
Interesting... PNDM generated GPUs
but DDIM with GPU-parallelism generated.... things like these ๐ค :
UNDER A MINUTE WITH PNDM! That's 3.57 sec/image!
Trying it with DDIM: Doggettx optimization: approximately 56 (P100) to 65 (P40) sec/image NickLucche parallelism: approximately 5.29 sec/image rocket
Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!
How would the second choice handle smaller GPUs in the pool? Does the work need to be split evenly between the GPUs (problematic if the GPUs are not evenly sized)?
I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one. Atm tho, I am thinking about adding these features in the upcoming days:
Then I'll be merging the results into the master branch and update the "stable" version. We can have other issues to handle other bugs/enhancements.
but DDIM with GPU-parallelism generated.... things like these thinking
Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g -e DEVICES=1
) when switching sampler?
Sorry for the late reply, thanks a lot for testing out the build and reporting the inference time too, that is super useful!
No worries, glad to help ๐
I was thinking about filling the biggest GPUs first, and placing only the lightest component on the small one. Atm tho, I am thinking about adding these features in the upcoming days:
- fp32 support
- simpler mode for users that have multiple small GPUs and want to run the model, as originally planned
- nsfw filter (if it doesn't take up too much time)
Then I'll be merging the results into the master branch and update the "stable" version. We can have other issues to handle other bugs/enhancements.
Yes please! ๐
Yeah that looks weird, are you getting the same gibberish results with the single-model version (e.g
-e DEVICES=1
) when switching sampler?
I'll try that and report back! Thank you again, so much, for your work on this.
@NickLucche I don't have results yet, but have you seen this? Would it be helpful? https://github.com/NVIDIA/nccl
Also what would you say "other issues" would be? I can make the issues now for you if you want
I don't have results yet, but have you seen this? Would it be helpful? https://github.com/NVIDIA/nccl
I think this one serves a different purpose by providing low level routines for operations, we're working on a higher abstraction level here by splitting a whole model (essentially a chain of operations).
Also what would you say "other issues" would be? I can make the issues now for you if you want
Thanks again for your help! Atm I think I want to re-structure the "Samples" section of the README to showcase some of the things you can do with stable diffusion, like fixing a seed and gradually increasing the guidance scale to get results that are progressively "closer" to the prompted input. I think some tips like that could be useful
Okay, I've added the fp32 support and polished up the code a bit. I'll need to test out that everything that was working before this change is still okay, then I'll be merging this into the master branch.
Awesome! Is there anything specific I need to do to use fp32 mode?
Thank you so much
The good old -e FP16=0
option should do! Closing this issue now before merge
@NickLucche what were the next steps after this? What issues did you want me to make? What things do you still want me to test? :)
could you re-test the latest image with -e MODEL_PARALLEL=1
? Just wanted to make sure it's working properly on multiple devices without hanging..
make sure you pull the latest image and don't use the one in your cache by adding --pull always
to the docker run command. Thanks a lot!
@NickLucche I'm getting an AssertionError
:(
latest: Pulling from nicklucche/stable-diffusion
Digest: sha256:199901bbb2a85da90ff91aecd1ccea899f7f8b8c0b407506740594dee4f280ab
Status: Image is up to date for nicklucche/stable-diffusion:latest
Loading model..
Looking for a valid assignment in which to split model parts to device(s): [2, 3, 4, 5]
Free GPU memory (per device): [3504, 6365, 6359, 6532]
Search has found that 5 model(s) can be split over 4 device(s)!
Assignments: [{0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}, {0: 0, 1: 1, 2: 1, 3: 0}]
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Model parallel worker component assignment: {0: 0, 1: 1, 2: 1, 3: 0}
Creating and moving model parts to respective devices..
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.34k/1.34k [00:00<00:00, 492kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 14.9k/14.9k [00:00<00:00, 206kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 342/342 [00:00<00:00, 351kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 543/543 [00:00<00:00, 206kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 4.56k/4.56k [00:00<00:00, 3.82MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.22G/1.22G [07:13<00:00, 2.81MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 209/209 [00:00<00:00, 426kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 592/592 [00:00<00:00, 328kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 492M/492M [00:06<00:00, 73.8MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 525k/525k [00:00<00:00, 1.32MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 472/472 [00:00<00:00, 475kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 806/806 [00:00<00:00, 792kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.06M/1.06M [00:00<00:00, 2.18MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 743/743 [00:00<00:00, 687kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3.44G/3.44G [02:11<00:00, 26.2MB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 71.2k/71.2k [00:00<00:00, 443kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 522/522 [00:00<00:00, 209kB/s]
Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 335M/335M [00:04<00:00, 71.9MB/s]
Traceback (most recent call last):
File "server.py", line 9, in <module>
from main import inference, MP as model_parallel
File "/app/main.py", line 55, in <module>
n_procs, devices, model_parallel_assignment=model_ass, **kwargs
File "/app/parallel.py", line 149, in from_pretrained
assert d
AssertionError
Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run
is... slow ๐
I also noticed that GPUs 0 and 1 are used (some conda
python process) even though I specified GPUs 2, 3, 4, 5
?
(note below: the /usr/bin/python3
processes on GPUs 0, 1, 6, 7 are expected from another application... and the 4.5-5GB python3
processes on GPUs 2, 3, 4, 5 are expected from another application.)
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11169 C /usr/bin/python3 6543MiB |
| 0 N/A N/A 11170 C /usr/bin/python3 5245MiB |
| 0 N/A N/A 458266 C /opt/conda/bin/python3 2981MiB |
| 0 N/A N/A 458278 C /opt/conda/bin/python3 2981MiB |
| 1 N/A N/A 11175 C /usr/bin/python3 897MiB |
| 1 N/A N/A 11178 C /usr/bin/python3 897MiB |
| 1 N/A N/A 11179 C /usr/bin/python3 897MiB |
| 1 N/A N/A 458266 C /opt/conda/bin/python3 2311MiB |
| 1 N/A N/A 458278 C /opt/conda/bin/python3 2311MiB |
| 2 N/A N/A 403173 C python3 5105MiB |
| 2 N/A N/A 458115 C python3 565MiB |
| 3 N/A N/A 164703 C python3 5115MiB |
| 3 N/A N/A 458115 C python3 565MiB |
| 4 N/A N/A 164949 C python3 4827MiB |
| 4 N/A N/A 458115 C python3 565MiB |
| 5 N/A N/A 165170 C python3 4827MiB |
| 5 N/A N/A 458115 C python3 565MiB |
| 6 N/A N/A 11171 C /usr/bin/python3 1607MiB |
| 7 N/A N/A 11172 C /usr/bin/python3 1615MiB |
+-----------------------------------------------------------------------------+
more error: tried on a pair of smaller cards:
To create a public link, set `share=True` in `launch()`.
0it [00:01, ?it/s]
Process Process-2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
sample = self.conv_in(sample)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-3:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
sample = self.conv_in(sample)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
sample = self.conv_in(sample)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
0it [00:01, ?it/s]
Process Process-4:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 137, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/models/unet_2d_condition.py", line 143, in forward
sample = self.conv_in(sample)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
Thanks a lot for testing! I'll re-open the issue until we fix this feature, I'll get a multi-gpu aws instance so I can test that too.
Also, is there a way to mount a cache folder for pip to download its packages to? Downloading 4GB+ every time I docker run is... slow
Good point I'll add a section in the Readme, it relates to #10
I also noticed that GPUs 0 and 1 are used (some conda python process) even though I specified GPUs 2, 3, 4, 5?
I'll look into that too, perhaps some leftover hanging processes..?
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
This looks like drivers error, we can have another issue with that with info on the specs of the cards
Thanks for the link! I'll add a volume for /root/.cache/
.
Do you want me to open issues for "leftover hanging processes" and "Unable to find valid cuDNN algorithm"?
(perhaps a missing python dependency?)
Also, tried running again - another error: CUBLAS this time
I think missing dependencies?
Running on local URL: http://localhost:7860/
To create a public link, set `share=True` in `launch()`.
51it [00:05, 8.74it/s]
Attempting to cast a BatchFeature to type None. This is not supported.
Process Process-4:
Process Process-3:
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/parallel.py", line 90, in cuda_inference_process
images: List[Image.Image] = model(prompts, **kwargs)["sample"]
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 82, in __call__
text_embeddings = self.text_encoder(text_input.input_ids.to(self.device))[0]
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/app/utils.py", line 103, in forward
y = self.layer(x.to(self.device), *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 734, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
output_attentions=output_attentions,
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 655, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
output_attentions=output_attentions,
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 582, in forward
output_attentions=output_attentions,
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
query_states = self.q_proj(hidden_states) * self.scale
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 325, in forward
output_attentions=output_attentions,
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 210, in forward
query_states = self.q_proj(hidden_states) * self.scale
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
I think missing dependencies?
yeah you're definitely missing some drivers for the card you're trying to use, I suggest you first try to install cudnn and run some example code on the new gpus; this "hello world" container from nvidia may help with that
docker run --rm --gpus <GPU_NUMBER_HERE> nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Looks fine to me:
The cards I'm trying to test are a 3070 and 3070 Ti
$ docker run --rm --gpus '"device=6,7"' nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu20.04' locally
11.0.3-base-ubuntu20.04: Pulling from nvidia/cuda
d7bfe07ed847: Already exists
75eccf561042: Pull complete
191419884744: Pull complete
a17a942db7e1: Pull complete
16156c70987f: Pull complete
Digest: sha256:57455121f3393b7ed9e5a0bc2b046f57ee7187ea9ec562a7d17bf8c97174040d
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu20.04
Tue Sep 20 13:19:37 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:8B:00.0 Off | N/A |
| 0% 40C P0 46W / 240W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:C1:00.0 Off | N/A |
| 45% 33C P0 70W / 310W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Hey, just found this thread! Great looking stuff! I put up a g4dn.12xlarge instance with 4 T4's, tried a command but ended up with AssertionError :/
[ec2-user@ip ~]$ docker run --name stable-diffusion --gpus all -it -e DEVICES=0,1,2,3 -e MODEL_PARALLEL=1 -e TOKEN=token -p 7860:7860 nicklucche/stable-diffusion:multi-gpu Loading model.. Looking for a valid assignment in which to split model parts to device(s): [0, 1, 2, 3] Free GPU memory (per device): [8665, 8665, 8665, 8665] Search has found that 17 model(s) can be split over 4 device(s)! Assignments: [{0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}, {0: 0, 1: 0, 2: 0, 3: 0}] Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Model parallel worker component assignment: {0: 0, 1: 0, 2: 0, 3: 0} Creating and moving model parts to respective devices.. Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.34k/1.34k [00:00<00:00, 739kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 12.5k/12.5k [00:00<00:00, 12.9MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 342/342 [00:00<00:00, 182kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 543/543 [00:00<00:00, 307kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 4.63k/4.63k [00:00<00:00, 2.48MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 608M/608M [00:07<00:00, 77.8MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 209/209 [00:00<00:00, 117kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 209/209 [00:00<00:00, 122kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 572/572 [00:00<00:00, 317kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 246M/246M [00:03<00:00, 72.5MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 525k/525k [00:00<00:00, 58.8MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 472/472 [00:00<00:00, 563kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 788/788 [00:00<00:00, 1.07MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.06M/1.06M [00:00<00:00, 62.3MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 772/772 [00:00<00:00, 1.07MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.72G/1.72G [00:22<00:00, 75.2MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 71.2k/71.2k [00:00<00:00, 37.7MB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 550/550 [00:00<00:00, 300kB/s] Downloading: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 167M/167M [00:02<00:00, 74.1MB/s] Traceback (most recent call last): File "server.py", line 9, in <module> from main import inference, MP as model_parallel File "/app/main.py", line 55, in <module> n_procs, devices, model_parallel_assignment=model_ass, **kwargs File "/app/parallel.py", line 149, in from_pretrained assert d AssertionError
It's loading a lot of models, 17 in fact. Might that be the culprit?
Anyways, if I can participate in testing or help in any way, I'm here to do so :) Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.
That makes 2 of us! Oh no :(
Hi @huotarih , thanks a lot for reporting this bug! I do have some issues developing on a multi-gpu system as I also need to get something on the cloud, but I'll into that asap! Would you mind opening a separate issue for this bug?
I'll also ask you to test back the fixed version if that's ok with you :)
Also wondering why it says only 8665MB of free memory when nvidia-smi told me I had 15360MiB per GPU free just before that.
Good point, currently I'm only taking 60% of the free memory of the GPU to instantiate the model(s), that is because generating one or more images requires a substantial amount of free memory, which is only occupied when you actually send the input to the network. 60% is a conservative threshold, as the memory varies with the requested image output, I am still unsure how to properly explain that to the user.
is there a way to get this working on Automatic1111? Single image generation job on multiple GPUs at once?
yes but it needs to be a separate contribution to Automatic1111 repo.
Recently, NVLink also appeared in our cloud; it doesnโt work out of the box We are waiting for implementation
Originally posted by @NickLucche in https://github.com/NickLucche/stable-diffusion-nvidia-docker/issues/5#issuecomment-1236097512
I would like to be able to pool resources (VRAM) from the multiple cards I have installed into one pool. For example,
I have 4x NVIDIA P100 cards installed. I want to combine them all (16GB VRAM each) into 64GB VRAM so that complicated or high-resolution images don't overload the process with a 16GB VRAM limit.
This also would be useful for people with multiple 4GB VRAM consumer/hobbyist cards to reach workable amounts of VRAM without buying enterprise GPUs.