ArkhamInsanity commented 3 weeks ago

Here is my setup, using ubuntu AMD 6800 XT 16GB Vram 32GB Ram Python version: 3.10.12 pytorch version: 2.2.1+rocm5.7

I am getting between 14s-15s/it with flux1-dev-Q2_K.gguf, also Q4_0 and Q6_K, with T5XXL_fp8 and fp16...

I only find examples of Nvidia speeds, so I don't know if I should expect faster results or not.

city96 commented 3 weeks ago

I'm not sure either what speed to expect as I don't have an AMD card. It looks like it's loading into lowvram mode from your screenshot though. What does your workflow look like?

ArkhamInsanity commented 3 weeks ago

I also tried different samplers, including basic workflow from comfyui with custom sampler, same results.

This is the results I get with schnell.fp8. I was able to use it without any issue three days ago, but today it keeps crashing during loading or after generating...

mamei16 commented 3 weeks ago

This is around the same speed I get on my 6800M. If you're not doing it already, try using the CLI argument --force-fp32

fgdfgfthgr-fox commented 3 weeks ago

This is around the same speed I get on my 6800M. If you're not doing it already, try using the CLI argument --force-fp32

That works for me, I experienced significant speedup after that... but why?

ArkhamInsanity commented 3 weeks ago

Thank you very much mamei16 ! It's much faster this way, I went from 15s/it to 8.73s/it

Same question, why? :D

xCentral commented 3 weeks ago

--force-fp32 should work if I just throw this behind "python3 main.py" in the bash window ?

mamei16 commented 3 weeks ago

This is around the same speed I get on my 6800M. If you're not doing it already, try using the CLI argument --force-fp32

That works for me, I experienced significant speedup after that... but why?

Thank you very much mamei16 ! It's much faster this way, I went from 15s/it to 8.73s/it

Same question, why? :D

I have no idea :D

mamei16 commented 3 weeks ago

--force-fp32 should work if I just throw this behind "python3 main.py" in the bash window ?

Yep!

AroPix commented 3 weeks ago

Do you guys use any other launch arguments? I have a 7800 XT and it's still pretty slow for me, even after forcing fp32. Edit: i get around 8s/it.

mamei16 commented 3 weeks ago

Do you guys use any other launch arguments? I have a 7800 XT and it's still pretty slow for me, even after forcing fp32. Edit: i get around 8s/it.

I don't know of any other arg that further improves performance. It just seems that the current software does not properly take advantage of AMD GPUs. Even older Nvidia cards perform better.

NextTechAndAI commented 3 weeks ago

Do you guys use any other launch arguments? I have a 7800 XT and it's still pretty slow for me, even after forcing fp32. Edit: i get around 8s/it.

My 6800 16GB VRAM 32GB RAM has 18.50s/it with DEV FP16, but 10.83s/it with Q5_KS when using ---force-fp32. Most other models have similar performance like FP16 and FP16 even gets much slower by using --force-fp32. I've tested --highvram, --gpu-only and some more runtime parameters without significant improvements.

city96 commented 3 weeks ago

What happens if you change this line in dequant.py from:

        out = dequantize(data, qtype, oshape, dtype=None)

To this:

        out = dequantize(data, qtype, oshape, dtype=torch.float32)

xCentral commented 3 weeks ago

Rocm 6's focus has been to optimize fp8 performance on Navi 3 from the documentation I can find. Might be worth it to try this model with a 6.1 Rocm Linux setup https://civitai.com/models/637170?modelVersionId=724032 I feel you should get better performance than going GGUF. The model is just slightly under 16G, so you'll need to test with your 7800. Works incredibly well on my 7900xtx

mamei16 commented 3 weeks ago

What happens if you change this line in dequant.py from:

        out = dequantize(data, qtype, oshape, dtype=None)

To this:

        out = dequantize(data, qtype, oshape, dtype=torch.float32)

Seems to have no effect.

NextTechAndAI commented 3 weeks ago

Same here, no effect after changing dequant.py. Nevertheless with --force-fp32 there is still a small speed increase for Q8_0 and a 2x speed increase for Q5_KS - with and without changing dequant.py. I'm on Zluda with ROCm 5.7 and both Q models mentioned are superior to the FP8 model, not least in terms of the quality of the results.

city96 commented 3 weeks ago

Interesting, thank you for testing. That change would make the dequantization also run in FP32, but it looks like that's not contributing to the slowdown significantly then.

Exploder98 commented 3 weeks ago

That works for me, I experienced significant speedup after that... but why?

Because by default Comfy tries to use bfloat16 as the compute datatype for FLUX (ref: https://github.com/comfyanonymous/ComfyUI/blob/7df42b9a2364bae6822fbd9e9fa10cea2e319ba3/comfy/supported_models.py#L645). bfloat16 seems to be very slow on AMD through pytorch for some reason (no hardware support and therefore it tries to emulate the datatype? idk)

NextTechAndAI commented 3 weeks ago

Thanks for explaining that change and the reference to bfloat16. BTW, there is no improvement in performance at all when using --force-fp16. I wanted to see if only a few models profit from --force-fp32:

With --force-fp32: Q2_K: 9.12s/it Q3_K_S: 9.59s/it Q4_0: 9.03s/it Q4_1: 9.03s/it Q4_K_S: 9.17s/it Q5_0: 11.04s/it Q5_1: 14.29s/it Q5_K_S: 11.89s/it Q6_K: 16.17s/it Q8_0: 20.05s/it

Without any runtime parameters: Q2_K: 17.96s/it Q3_K_S: 18.46s/it Q4_0: 17.90s/it Q4_1: 17.89s/it Q4_K_S: 18.04s/it Q5_0: 18.58s/it Q5_1: 18.55s/it Q5_K_S: 18.33s/it Q6_K: 18.24s/it Q8_0: 18.64s/it

All tests with SCHNELL 1024x1024, t5xxl_fp16, comparable results with 2 quick DEV tests. Original FP16 similar to Q8_0. It seems that with --force-fp32 you sacrifice quality for speed and space - as I would expect from quantized models. For some reason this is not the case without this parameter. Would be interesting to know how this parameter influences other AMD GPUs and NVIDIA GPUs.

Exploder98 commented 3 weeks ago

(Nearly) doubling the inference speed by setting some environment variables

I managed to get the speed (FLUX dev, 1024x1024, Q6_K, t5xxl_fp8_e4m3fn) on my RX 6900 XT from 7.3 s/it to around 3.85 s/it by running optimization through PyTorch TunableOp. This is still slower than an RTX 3080 (which gets about 2 s/it) which should have comparable hardware performance, but the improvement is nonetheless sizable.

How I did this

This was actually pretty simple once I found out about TunableOp. It requires at least PyTorch v2.3. To use it, I just had to run ComfyUI like this:

PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 python main.py --novram

The initial optimization run will take like half an hour because TunableOp goes through all variants of the matrix multiplication operations trying to find the fastest one for each individual multiplication, but subsequent runs should be fast because the results are saved to a file (tunableop_results.csv by default, can be changed with the PYTORCH_TUNABLEOP_FILENAME environment variable). I had to run ComfyUI with --novram because otherwise I would run out of GPU memory at one point during the optimization process for some reason. When rerunning ComfyUI without this flag afterwards, optimization for one operation did repeat once for some reason (but without OOM).

Additional notes

Unfortunately, changing image size causes the optimization to be run again (matrix sizes change), but I think the results are saved incrementally so that going through all needed resolutions once should be enough. Also, updating anything related to PyTorch/ROCm will cause the optimization results to be invalid and therefore the optimization needs to be run again. I don't know if this optimization works on Windows, as I'm running Linux.

Additionally, I have edited the line I mentioned in my previous comment from

supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]

to

supported_inference_dtypes = [torch.float16, torch.float32]

so that ComfyUI would run the inference in fp16. I did not check whether just using --force-fp32 and TunableOp would bring the same performance boost, but using fp16 should anyway be faster. I also tested the non-GGUF fp8 version, and it is slightly faster compared to GGUF, at around 3.3 s/it.

Final words

So yeah, it seems that the low performance is caused at least partly by PyTorch choosing non-optimal matrix multiplication operations on AMD GPUs. There's probably still something wonky going on as I think my card should still be a bit faster, but almost doubling the performance with a couple of environment variables is already pretty good.

I wonder how many other projects would benefit from running TunableOp. It would also be interesting to know whether this effect is only limited to consumer cards, or if this is present even on datacenter-class cards.

xCentral commented 3 weeks ago

(Nearly) doubling the inference speed by setting some environment variables

I managed to get the speed (FLUX dev, 1024x1024, Q6_K, t5xxl_fp8_e4m3fn) on my RX 6900 XT from 7.3 s/it to around 3.85 s/it by running optimization through PyTorch TunableOp. This is still slower than an RTX 3080 (which gets about 2 s/it) which should have comparable hardware performance, but the improvement is nonetheless sizable.

How I did this

This was actually pretty simple once I found out about TunableOp. It requires at least PyTorch v2.3. To use it, I just had to run ComfyUI like this:
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 python main.py --novram
The initial optimization run will take like half an hour because TunableOp goes through all variants of the matrix multiplication operations trying to find the fastest one for each individual multiplication, but subsequent runs should be fast because the results are saved to a file (tunableop_results.csv by default, can be changed with the PYTORCH_TUNABLEOP_FILENAME environment variable). I had to run ComfyUI with --novram because otherwise I would run out of GPU memory at one point during the optimization process for some reason. When rerunning ComfyUI without this flag afterwards, optimization for one operation did repeat once for some reason (but without OOM).

Additional notes

Unfortunately, changing image size causes the optimization to be run again (matrix sizes change), but I think the results are saved incrementally so that going through all needed resolutions once should be enough. Also, updating anything related to PyTorch/ROCm will cause the optimization results to be invalid and therefore the optimization needs to be run again. I don't know if this optimization works on Windows, as I'm running Linux.

Additionally, I have edited the line I mentioned in my previous comment from
supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]
to
supported_inference_dtypes = [torch.float16, torch.float32]
so that ComfyUI would run the inference in fp16. I did not check whether just using --force-fp32 and TunableOp would bring the same performance boost, but using fp16 should anyway be faster. I also tested the non-GGUF fp8 version, and it is slightly faster compared to GGUF, at around 3.3 s/it.

Final words

So yeah, it seems that the low performance is caused at least partly by PyTorch choosing non-optimal matrix multiplication operations on AMD GPUs. There's probably still something wonky going on as I think my card should still be a bit faster, but almost doubling the performance with a couple of environment variables is already pretty good.

I wonder how many other projects would benefit from running TunableOp. It would also be interesting to know whether this effect is only limited to consumer cards, or if this is present even on datacenter-class cards.

Not sure if you're already using it but you can also try adding in flash attention https://github.com/Beinsezii/comfyui-amd-go-fast

Exploder98 commented 3 weeks ago

I have not tried that! However, looks like it only works on Navi 3 for now, I have an RX 6900 XT which is a Navi 2 card :(

I think I tried to compile Flash Attention for ROCm at some point and it failed so I don't have high hopes of getting it to work this time... EDIT: yeah, both the develop branch and the one recommended in amd-go-fast README fail to compile.

xCentral commented 3 weeks ago

Sorry to hear. I actually went from 46 seconds to 36 seconds a generation using it with Flux Dev, 24 steps, 1280x720 Did you happen to get a HIPBLASLT_TENSILE error when running pytorch tuneable? Mainly states the patch for the gfx1100, can't be found. Think it maybe a issue with pytorch 2.5 nightly enforcing something.

Exploder98 commented 3 weeks ago

Did you happen to get a HIPBLASLT_TENSILE error when running pytorch tuneable?

I got some warning related to that, but the optimization still worked. Apparently it's possible to disable hipblaslt stuff by setting PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED=0 (all env vars are listed in the TunableOp README).

xCentral commented 3 weeks ago

Thank you! So with the tuning parameter, it took off about 3 seconds. So I'm down to 32.9 seconds on average with 24 steps @ 1280x720

AroPix commented 3 weeks ago

Thanks guys! Was able to get it from 8s/it to 2s/it on my 7800 XT using these flags and flash attention. A lot better now! :)

ArkhamInsanity commented 2 weeks ago

Did you happen to get a HIPBLASLT_TENSILE error when running pytorch tuneable?

I got some warning related to that, but the optimization still worked. Apparently it's possible to disable hipblaslt stuff by setting PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED=0 (all env vars are listed in the TunableOp README).

I tried it but I get this error

reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed Memory access fault by GPU node-1 (Agent handle: 0x2f900b0) on address 0x74e206205000. Reason: Page not present or supervisor privilege. Abandon (core dumped)

Exploder98 commented 2 weeks ago

key="PT_VERSION" is not provided for validation.

This sounds almost like the csv file (tunableop_results0.csv) is somehow corrupted. Can you check what its contents look like? Maybe try deleting/renaming the csv file and see if rerunning the optimization works.

mamei16 commented 2 weeks ago

Did you happen to get a HIPBLASLT_TENSILE error when running pytorch tuneable?

I got some warning related to that, but the optimization still worked. Apparently it's possible to disable hipblaslt stuff by setting PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED=0 (all env vars are listed in the TunableOp README).

I tried it but I get this error

reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed Memory access fault by GPU node-1 (Agent handle: 0x2f900b0) on address 0x74e206205000. Reason: Page not present or supervisor privilege. Abandon (core dumped)

I'm also unable to run the TunableOp optimization because of this.

Memory access fault by GPU node-1 (Agent handle: 0x2f900b0) on address 0x74e206205000. Reason: Page not present or supervisor privilege. is a known error experienced by ROCm users for a couple of years now. So far, none of the dozen or so "solutions" I've seen were able to solve that issue for me.

I think the lines before the access fault error are simply a result of tunableop_results0.csv being an empty file.

harakiru commented 2 weeks ago

(Nearly) doubling the inference speed by setting some environment variables

I managed to get the speed (FLUX dev, 1024x1024, Q6_K, t5xxl_fp8_e4m3fn) on my RX 6900 XT from 7.3 s/it to around 3.85 s/it by running optimization through PyTorch TunableOp. This is still slower than an RTX 3080 (which gets about 2 s/it) which should have comparable hardware performance, but the improvement is nonetheless sizable.

How I did this

This was actually pretty simple once I found out about TunableOp. It requires at least PyTorch v2.3. To use it, I just had to run ComfyUI like this:
PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 python main.py --novram
The initial optimization run will take like half an hour because TunableOp goes through all variants of the matrix multiplication operations trying to find the fastest one for each individual multiplication, but subsequent runs should be fast because the results are saved to a file (tunableop_results.csv by default, can be changed with the PYTORCH_TUNABLEOP_FILENAME environment variable). I had to run ComfyUI with --novram because otherwise I would run out of GPU memory at one point during the optimization process for some reason. When rerunning ComfyUI without this flag afterwards, optimization for one operation did repeat once for some reason (but without OOM).

Additional notes

Unfortunately, changing image size causes the optimization to be run again (matrix sizes change), but I think the results are saved incrementally so that going through all needed resolutions once should be enough. Also, updating anything related to PyTorch/ROCm will cause the optimization results to be invalid and therefore the optimization needs to be run again. I don't know if this optimization works on Windows, as I'm running Linux.

Additionally, I have edited the line I mentioned in my previous comment from
supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]
to
supported_inference_dtypes = [torch.float16, torch.float32]
so that ComfyUI would run the inference in fp16. I did not check whether just using --force-fp32 and TunableOp would bring the same performance boost, but using fp16 should anyway be faster. I also tested the non-GGUF fp8 version, and it is slightly faster compared to GGUF, at around 3.3 s/it.

Final words

So yeah, it seems that the low performance is caused at least partly by PyTorch choosing non-optimal matrix multiplication operations on AMD GPUs. There's probably still something wonky going on as I think my card should still be a bit faster, but almost doubling the performance with a couple of environment variables is already pretty good.

I wonder how many other projects would benefit from running TunableOp. It would also be interesting to know whether this effect is only limited to consumer cards, or if this is present even on datacenter-class cards.

I'm trying to do this on my machine with 32GB of RAM and a RX 6800XT but i keep getting out of memory errors, also the .csv file is always empty. How much RAM did you need while you did this? I'm using the full model at fp8 mode while using gguf clip loader to load a q5_k_m T5 model along with a lora.

Exploder98 commented 2 weeks ago

I think CPU RAM usage was around 20 GiB at max, so it should fit in 32 GiB (I have 64). Did you try with --force-fp32 or did you use my edit to supported_models.py? I tried with --force-fp32 and got some OOMs with that. Also, increasing the batch size even from 1 to 2 seems to cause OOMs during tuning. Remember to use --no-vram for the tuning run, otherwise you'll get an OOM.

I did not try with LoRAs though. Maybe doing the tuning once first without LoRA and then adding the LoRA afterwards could work?

harakiru commented 2 weeks ago

I think CPU RAM usage was around 20 GiB at max, so it should fit in 32 GiB (I have 64). Did you try with --force-fp32 or did you use my edit to supported_models.py? I tried with --force-fp32 and got some OOMs with that. Also, increasing the batch size even from 1 to 2 seems to cause OOMs during tuning. Remember to use --no-vram for the tuning run, otherwise you'll get an OOM.

I did not try with LoRAs though. Maybe doing the tuning once first without LoRA and then adding the LoRA afterwards could work?

I tried doing it for sdxl model and after updating rocm+torch to the latest version the OOM issue went away. I'm now struggling with the .csv file not being generated correctly. It creates a .csv file when i first run it, however it stays empty after tuning, and when i try to run it again it complains about "reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed". I dont really really know what else might be the problem since it doesn't spit out any error messages aside from that.

fgdfgfthgr-fox commented 2 weeks ago

I tried doing it for sdxl model and after updating rocm+torch to the latest version the OOM issue went away. I'm now struggling with the .csv file not being generated correctly. It creates a .csv file when i first run it, however it stays empty after tuning, and when i try to run it again it complains about "reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed". I dont really really know what else might be the problem since it doesn't spit out any error messages aside from that.

Use ctrl+c on the terminal to turn the comfyui off and you should see a message saying it's writing the result to the file before the terminal closes.

ArkhamInsanity commented 2 weeks ago

Rocm 6's focus has been to optimize fp8 performance on Navi 3 from the documentation I can find. Might be worth it to try this model with a 6.1 Rocm Linux setup https://civitai.com/models/637170?modelVersionId=724032 I feel you should get better performance than going GGUF. The model is just slightly under 16G, so you'll need to test with your 7800. Works incredibly well on my 7900xtx

Yep I am trying it today, and I am getting 7.18s/it with argument --force-fp32 and fp8 flux1 diffusion model and t5xxl_fp8. Need to try different setups (GGUF flux + fp8 t5xxl, fp8 flux1 + GGUF t5xxl), to see how it goes

harakiru commented 2 weeks ago

I tried doing it for sdxl model and after updating rocm+torch to the latest version the OOM issue went away. I'm now struggling with the .csv file not being generated correctly. It creates a .csv file when i first run it, however it stays empty after tuning, and when i try to run it again it complains about "reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed". I dont really really know what else might be the problem since it doesn't spit out any error messages aside from that.

Use ctrl+c on the terminal to turn the comfyui off and you should see a message saying it's writing the result to the file before the terminal closes.

Figured it out. It was because i was launching it through comfyui-manager's comfy-cli. You need to launch comfyui through "python main.py" or it wont write the .csv file. Will tinker around with it more when i get the chance.

cyber827 commented 2 weeks ago

I tried doing it for sdxl model and after updating rocm+torch to the latest version the OOM issue went away. I'm now struggling with the .csv file not being generated correctly. It creates a .csv file when i first run it, however it stays empty after tuning, and when i try to run it again it complains about "reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed". I dont really really know what else might be the problem since it doesn't spit out any error messages aside from that.

Use ctrl+c on the terminal to turn the comfyui off and you should see a message saying it's writing the result to the file before the terminal closes.

Figured it out. It was because i was launching it through comfyui-manager's comfy-cli. You need to launch comfyui through "python main.py" or it wont write the .csv file. Will tinker around with it more when i get the chance.

Same issue on Windows, the csv file remains empty, getting the PT_VERSION error, running from main.py does not work. Any idea on how to fix it? What exactly is generated in the first row of the csv file?

city96 / ComfyUI-GGUF

Low speed on AMD GPU #48

(Nearly) doubling the inference speed by setting some environment variables

How I did this

Additional notes

Final words

(Nearly) doubling the inference speed by setting some environment variables

How I did this

Additional notes

Final words

(Nearly) doubling the inference speed by setting some environment variables

How I did this

Additional notes

Final words