TimDettmers / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
5.72k stars 583 forks source link

bitsandbytes-rocm rocm-5.6.0 support #681

Open ccbadd opened 11 months ago

ccbadd commented 11 months ago

Any chance to get the rocm branch updated to support rocm-5.6.0?

arlo-phoenix commented 11 months ago

If you use the official docker images/have hipBlasLt installed, you could try my fork here. Also probably not the best place to ask this, there are several user ROCm ports, but none are made by the maintainer of this project.

hameerabbasi commented 10 months ago

@arlo-phoenix Could you make a PR? I'd love to have this working/maintained in this repo. @TimDettmers is there a reason not to accept this?

arlo-phoenix commented 10 months ago

@arlo-phoenix Could you make a PR? I'd love to have this working/maintained in this repo. @TimDettmers is there a reason not to accept this?

I'd love that as well. ROCm currently makes that quite hard though. First off is a workaround to make it use the correct waveSize so that 4Bit works, second is, that HIP doesn't fully offer the same level of support in libraries as CUDA. hipBlaslt doesn't support all needed instructions and is only for some old arch-version, so I use another workaround to at least be able to include the header. I wrote a bit more details in https://github.com/TimDettmers/bitsandbytes/issues/47. The hipBlaslt doesn't matter that much, cause we can just document what's not supported, the hacky way of making it work can't really be merged though imo.

I might make a draft though, I think this is the easiest to maintain way of including ROCm in the project. The 'just making the CUDA code work on HIP with ifdef's' approach seems to be popular as well, e.g. the ROCM-Port of llama.cpp where I took the idea from uses it.

nonetrix commented 10 months ago

This being upstreamed would solve 99% of AI issues on ROCm, hope to see progress on this one day. Most up to date fork I've found, I'll try run some LLMs with it hopefully and maybe even try sd-scripts and post results I guess. Is the hack used something that could be eventually overcome with enough work? Would it need changes to the GPUs themselves, or ROCm, or is it just more difficult? I'm sorry if I'm misunderstanding things, I'm way out of my league here lol

nonetrix commented 10 months ago

If it can't be easily overcome, maybe just disable 4Bit support and throw a exception when it is used on ROCm. Then someone can just maintain a fork that uses the 4Bit hack, I think this would be vastly better than the current situation

arlo-phoenix commented 9 months ago

If it can't be easily overcome, maybe just disable 4Bit support and throw a exception when it is used on ROCm. Then someone can just maintain a fork that uses the 4Bit hack, I think this would be vastly better than the current situation

I opened a draft PR for now. I just don't know yet since when GPU's support the Wavefront Size 32 and if the hack causes other issues. Disabling 4bit isn't really an option to me as I made that port specifically for it to test out Microsoft Guidance among other llama stuff. There isn't really a big reason for it as well if the redefine doesn't cause issues elsewhere. I just don't have that much compute/ROCm experience myself. Since I previously just ran into a simple failing static_assert and now it works I assume it shouldn't cause major issues, but I'd rather have it work without it.

This being upstreamed would solve 99% of AI issues on ROCm, hope to see progress on this one day. Most up to date fork I've found, I'll try run some LLMs with it hopefully and maybe even try sd-scripts and post results I guess. Is the hack used something that could be eventually overcome with enough work? Would it need changes to the GPUs themselves, or ROCm, or is it just more difficult? I'm sorry if I'm misunderstanding things, I'm way out of my league here lol

Maybe it will be fixed by 5.7, from the stuff I read I don't even know where the issue directly is. The one issue I found said it was in another repo and can't easily be avoided with a workaround as they use precompiled headers. But it's a ROCm issue, the hardware is definitely ready since RDNA 1 from what I've read, which is long enough for most people running AI stuff. If you have any results on perfomance / support just comment on the PR from now on. I expect it to be open quite a while to see if it causes any issues / anything is missing so any tests will be helpful!

BramVanroy commented 3 months ago

Any plans to improve ROCm support?

younesbelkada commented 3 months ago

Hi @BramVanroy Yes this is still the plan ! please see: https://github.com/TimDettmers/bitsandbytes/discussions/990 for more details

BramVanroy commented 3 months ago

@younesbelkada Good to hear, thanks!