PRIS-CV / DemoFusion

Let us democratise high-resolution generation! (CVPR 2024)
https://ruoyidu.github.io/demofusion/demofusion.html
1.97k stars 229 forks source link

RTX 3090 insane low speed #11

Open davizca opened 10 months ago

davizca commented 10 months ago

Hi.

I'm using RTX 3090 GPU with 24 GB VRAM and I think there is something wrong.

35faf802b17d4b69e41f33b9a901fc26 52b3d01544b5f18eace9520d6391c94a

Theorically it should be 3 minutes or so and nope.

Also posted on reddit

Cheers!

RuoyiDu commented 10 months ago

Hi @davizca, please try to set view_batch_size to 16. It should work for 3090 and will make inference faster.

davizca commented 10 months ago

Hi @RuoyiDu, thanks for the answer.

I did view batch size to 16 and for a 2048x2048 on phase 2 decoding is giving me like +12 mins (and it's increasing slowly, I guess its the same as before). 1024x1024 is running super fast though.

Settings and screenshot added:

f03bdc52b8a27a9df46b735f4cadf012 4edf54d6da190bf02947f77c8acf4ad6

Cheers.

RuoyiDu commented 10 months ago

Hi @davizca, this is very strange now. Are you running on a laptop with RTX3090? The power of the GPU also affects the inference time -- I'm using the RTX3090 on a local server with the power of 350W. You can check the power by nvidia-smi.

davizca commented 10 months ago

Hi @RuoyiDu No, I'm using desktop PC with Wiindows and RTX 3090.

Nvidia-smi says 190 average W. And Board Draw Power (power) the same. Peaking 23.6 GB VRAM inferencing a 2048x2048 image. The part where it takes eternal time is phase 2 decoding (the previous are fast). I don't know if this is because some dependencies but if other users with RTX 3090 can test it will be awesome. I never got with this pipeline constant 350W of BDP.

t

RuoyiDu commented 10 months ago

Hi @davizca, on my server, it takes about 80s under full load.

ζˆͺ屏2023-12-07 01 56 07

I'll try to optimise the speed of the decoding. But it looks like there are other reasons here for it being especially slow at your end. Let's see if anyone else in the community is experiencing similar issues.

siraxe commented 10 months ago

3090 on pc At phase 2 Decoding at 2k resolution it throws work into Shared GPU memory and slows down to unusable point WindowsTerminal_j4WHv00GvA Taskmgr_qWv2vMLOZf

RuoyiDu commented 10 months ago

Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set multi_decoder=False? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.

siraxe commented 10 months ago

Hi @siraxe @davizca. Can you guys try to generate at 2048x2048 and set multi_decoder=False? For generating 2048x2048 images on 3090, we don't need the tiled decoder. Then we can see if the problem is with the tiled decoder.

pycharm64_cE0UfnbgOs firefox_5bgWRJWcyU Okay that helped , about 328 second for 50 stepsπŸ‘

RuoyiDu commented 10 months ago

Thanks @siraxe! But it's still much slower than on my machine... It seems the decoder is quite slow on your PC, which makes it ridiculously slow when using tiled decoder. I will try to figure out the reason -- but it may be a little hard for me since I can't reproduce this issue on my end.

BTW, I like your generation! Hope you can enjoy it!

Yggdrasil-Engineering commented 10 months ago

Was also seeing super slow times on my 4090. Set multi_decoder=False and speed dramatically improved! It's amazing what the parameters being piped can do to generation times.

With a low batch size of 4, and multi-decoding set to true I was seeing hour long generation times. Down to 6 minutes now that I've fixed those! Hope this information is helpful.

davizca commented 10 months ago

EI hi. Thanks everyone for checking into this. Currently I'm not at home but on Monday will try the fix. Its weird the difference in inferencing times of @RuoyiDu and the others... we will see Whats happening here ;)

RuoyiDu commented 10 months ago

Hi guys @davizca @siraxe @Yggdrasil-Engineering, I find a little mistake at line #607: pad_size = self.unet.config.sample_size // 4 * 3 should be pad_size = self.unet.config.sample_size // 8 * 3. This should make the VRAM cost in line with the paper (about 17GB) and also make it decodes faster when multi_decoder=True. But this bug doesn't affect the result of multi_decoder=False. So I think there might be other reasons, like GPU power (I'm using a 350W RTX 3090 instead of a 280W one).

davizca commented 9 months ago

@RuoyiDu With multidecoder = TRUE (normal settings, 2048x2048):

Phase 1 Denoising

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:39<00:00, 1.27it/s] Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:22<00:00, 3.21s/it]

Phase 1 Denoising

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:13<00:00, 3.58it/s]

Phase 2 Denoising

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [02:56<00:00, 3.42s/it]### Phase 2 Decoding ### 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 64/64 [00:23<00:00, 2.70it/s] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [03:24<00:00, 4.09s/it]

With multidecoder = False (same settings): 3:30 more or less the same. (Will make screenshot later).