comfyanonymous / ComfyUI

The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
42.57k stars 4.5k forks source link

ROCm for Windows has been released #1705

Open Acrivec opened 9 months ago

Acrivec commented 9 months ago

Are there any plans to support ROCm on Windows? AMD users would be happy. https://rocm.docs.amd.com/en/latest/deploy/windows/index.html

comfyanonymous commented 9 months ago

AMD didn't port enough of ROCm to windows to get pytorch working.

NeedsMoar commented 9 months ago

Well they already had it, but MIOpen.dll with OpenCL kernels apparently wasn't as fast as HIP and they opted to only release a Windows version with Radeon ProRender (for the denoiser) rather than put a slower version out that could have allowed torch to start porting 6 months ago.

Apparently MIOpen.dll's HIP flavor can be built for Windows from one of their dev branches now without issues, and somebody posted their build on google drive if anyone is brave... In theory building torch for ROCm is just a switch that needs that DLL, .lib and ROCm / HIP on the path and there's a script that will rename all the cuda stuff to call ROCm intrinsics instead. Once that's done I'd expect it to be faster than HIP for Linux since there's official support / kernels for non-Instinct GPUs (stupidly no support for the Instinct accelerators though, unfortunate since MI100s are $1100 on eBay and have respectable matrix speeds. Unfortunately AMD's development on all that mess is only halfway open source (offline / internal bug trackers and such, the original Windows port pull request for MIOpen that I'd found out about the HIP vs. OpenCL thing on being deleted out of nowhere, etc)

Acrivec commented 9 months ago

Well then, I hope they'll get this shit straight in Q12024, as of they announced FSR 3.0 release and some new features for RDNA3.

Andyholm commented 9 months ago

Found this, might be good news??

cccyberwolke commented 7 months ago

Kobold (for LLM) and Shark (for SD) are working with great performance using rocm on Windows on my 7900XTX. Anything that might be able to be ported from their projects to improve performance for Comfy on Windows?

NeedsMoar commented 7 months ago

Kobold (for LLM) and Shark (for SD) are working with great performance using rocm on Windows on my 7900XTX. Anything that might be able to be ported from their projects to improve performance for Comfy on Windows?

Can't speak for Kobold / LLMs. I'm not a fan of LLMs in general because they're just crapflooding the internet and they don't really fit into comfy well. Some people have prompt generators that run or call out to them but that never really interested me either, that and building workflows are the only interactive parts of SD / Comfy. To each his own though.

Is Rocm actually usable on windows in shark now? The last time I was over there the bug reports indicated it was roughly 1/10th the speed of the vulkan backend, which was about 27it/s in SD 2.0 base for batch size 1 512x512 when I was using it.

I'm going by older info but I'd go with no on shark because I don't think their backend (LLVM compilation) has actually changed. It currently only handles one LoRA at once, is broken for SD 2.0 / 2.1 768 models (they just silently grab the 512 version), and requires tuning files for a specific resolution to work. If you go outside the main resolutions of it, it usually produces degenerate code that runs far more slowly than DirectML on Windows. 640x640 on anything or 768x768 on sd1.5 tank to 13s per iteratation, and generate messed up code that's looping and using 100% of the GPU but mostly on fences, running on under 200W of power so that it upclocks to around 3.3GHz where it starts becoming flakey and making the whole system choppy. If you're bored enough to let that finish you get RGB noise garbage as the image.

It requires early compilation of models in every pipeline with this monstrosity of a function:

class SharkifyStableDiffusionModel:
    def __init__(
        self,
        model_id: str,
        custom_weights: str,
        custom_vae: str,
        precision: str,
        max_len: int = 64,
        width: int = 512,
        height: int = 512,
        batch_size: int = 1,
        use_base_vae: bool = False,
        use_tuned: bool = False,
        low_cpu_mem_usage: bool = False,
        debug: bool = False,
        sharktank_dir: str = "",
        generate_vmfb: bool = True,
        is_inpaint: bool = False,
        is_upscaler: bool = False,
        use_stencil: str = None,
        use_lora: str = "",
        use_quantize: str = None,
        return_mlir: bool = False,
    ):

When that's called, it attempts to figure out what CLiP and base UNET your model is using, downloads the roughly 7GB of data from civitai, then makes a copy of that data into another directory along with the extracted model you're actually using eating up more space, then compiles those models to about 3-5GB of flatbuffers for it that it can finally run. You can't delete any of that 14GB of pointless data and still run the model it built, it'll just generate it again. If you change generation size, it has to compile another one. If you switch LoRAs, it has to compile again. If you add a VAE it has to partially compile. If your prompt goes over 64 characters but originally wasn't, it has to compile a second, even larger UNET flatbuffer. When you change models or when upstream LLVM breaks something you get to wipe them all and do it all over again. I used comfy + directml on that card for months despite the slower speed rather than shark because of the insane limitations. 500GB of drive space eaten up by what amounted to mostly temp files and things it didn't even need to download in the first place was bad.

To top that all off it has terrible image generation on the basic pipeline it uses compared to comfy (something is off with all images, I don't know why), and isn't capable of honoring clip skip config files which screws up lots of models.

From experience I can say that if nothing else, you don't want to have to deal with LLVM issues when they break something upstream unless you're a compiler engineer. You don't want to have to deal with them if you are a compiler engineer but at least you know what you're doing. They're actually a really friendly community, that's not the issue, it's that to make headway on bugs that don't have a minimal repro you need to know the internals of the thing enough to track down the repro from the error messages which means knowing LLVM-IR (and for Shark MLIR) and possibly learn how TableGen works if it's an instruction generation error that implicated TableGen. You'll need to know things like what artifacts to start trying to minimize if you get a message like

Instruction does not dominate all uses! %1 = alloca i32 %mul = load i32* %1

They'll generally be very nice to you in explaining how to get things like that and have tools for all of it so you're not left flopping in a drained pond like you would be if you were stuck trying to troubleshoot GCC, but I'm just pointing out what kind of enormous pains the various ways Shark does things can cause.

Although it was fun sometimes I quit doing that for a living for a reason. :-)