Open davizca opened 10 months ago
Hi @davizca, please try to set view_batch_size to 16. It should work for 3090 and will make inference faster.
Hi @RuoyiDu, thanks for the answer.
I did view batch size to 16 and for a 2048x2048 on phase 2 decoding is giving me like +12 mins (and it's increasing slowly, I guess its the same as before). 1024x1024 is running super fast though.
Settings and screenshot added:
Cheers.
Hi @davizca, this is very strange now. Are you running on a laptop with RTX3090? The power of the GPU also affects the inference time -- I'm using the RTX3090 on a local server with the power of 350W. You can check the power by nvidia-smi
.
Hi @RuoyiDu No, I'm using desktop PC with Wiindows and RTX 3090.
Nvidia-smi says 190 average W. And Board Draw Power (power) the same. Peaking 23.6 GB VRAM inferencing a 2048x2048 image. The part where it takes eternal time is phase 2 decoding (the previous are fast). I don't know if this is because some dependencies but if other users with RTX 3090 can test it will be awesome. I never got with this pipeline constant 350W of BDP.
Hi @davizca, on my server, it takes about 80s under full load.
I'll try to optimise the speed of the decoding. But it looks like there are other reasons here for it being especially slow at your end. Let's see if anyone else in the community is experiencing similar issues.
3090 on pc At phase 2 Decoding at 2k resolution it throws work into Shared GPU memory and slows down to unusable point
Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set multi_decoder=False
? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.
Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set
multi_decoder=False
? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.
Okay that helped , about 328 second for 50 stepsπ
Thanks @siraxe! But it's still much slower than on my machine... It seems the decoder is quite slow on your PC, which makes it ridiculously slow when using tiled decoder. I will try to figure out the reason -- but it may be a little hard for me since I can't reproduce this issue on my end.
BTW, I like your generation! Hope you can enjoy it!
Was also seeing super slow times on my 4090. Set multi_decoder=False
and speed dramatically improved!
It's amazing what the parameters being piped can do to generation times.
With a low batch size of 4, and multi-decoding set to true I was seeing hour long generation times. Down to 6 minutes now that I've fixed those! Hope this information is helpful.
EI hi. Thanks everyone for checking into this. Currently I'm not at home but on Monday will try the fix. Its weird the difference in inferencing times of @RuoyiDu and the others... we will see Whats happening here ;)
Hi guys @davizca @siraxe @Yggdrasil-Engineering, I find a little mistake at line #607:
pad_size = self.unet.config.sample_size // 4 * 3
should be
pad_size = self.unet.config.sample_size // 8 * 3
.
This should make the VRAM cost in line with the paper (about 17GB) and also make it decodes faster when multi_decoder=True
.
But this bug doesn't affect the result of multi_decoder=False
. So I think there might be other reasons, like GPU power (I'm using a 350W RTX 3090 instead of a 280W one).
@RuoyiDu With multidecoder = TRUE (normal settings, 2048x2048):
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:39<00:00, 1.27it/s] Loading pipeline components...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:22<00:00, 3.21s/it]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:13<00:00, 3.58it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [02:56<00:00, 3.42s/it]### Phase 2 Decoding ### 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 64/64 [00:23<00:00, 2.70it/s] 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [03:24<00:00, 4.09s/it]
With multidecoder = False (same settings): 3:30 more or less the same. (Will make screenshot later).
Hi.
I'm using RTX 3090 GPU with 24 GB VRAM and I think there is something wrong.
Theorically it should be 3 minutes or so and nope.
Also posted on reddit
Cheers!