ROCm / bitsandbytes

8-bit CUDA functions for PyTorch
MIT License
34 stars 3 forks source link

Fix wmma api parity #6

Closed Lzy17 closed 7 months ago

Lzy17 commented 7 months ago

hipify the wmma api call with rocwmma

seungduk-yanolja commented 7 months ago

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

pnunna93 commented 7 months ago

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

seungduk-yanolja commented 7 months ago

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: https://github.com/ROCm/ROCm/issues/2885

Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch. ChatGPT said this:

The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.

The key points in the backtrace indicating the source of the issue are:

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

pnunna93 commented 7 months ago

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: ROCm/ROCm#2885

Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch. ChatGPT said this:

The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.

The key points in the backtrace indicating the source of the issue are:

  • The termination happens after an attempt to load ExtOpMasterLibrary from libhipblaslt.so, which is part of the ROCm software stack.
  • The crash is preceded by a std::runtime_error, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate(), which then causes the process to abort.

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

Hi @seungduk-yanolja , please reinstall hipblaslt with these steps: git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt cd hipBLASLt git checkout 4b3b34405e7e25cff404f69bfd0a832644430477 ./install.sh -idc

You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.

seungduk-yanolja commented 7 months ago

Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing bitsandbytes after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!

Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.

Reported the issue here: ROCm/ROCm#2885 Hi @pnunna93, yes, I installed it from the rocm_enabled branch because I saw PRs were merged into this branch. ChatGPT said this: The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch package, more precisely within libhipblaslt.so, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies. The key points in the backtrace indicating the source of the issue are:

  • The termination happens after an attempt to load ExtOpMasterLibrary from libhipblaslt.so, which is part of the ROCm software stack.
  • The crash is preceded by a std::runtime_error, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate(), which then causes the process to abort.

Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.

Hi @seungduk-yanolja , please reinstall hipblaslt with these steps: git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt cd hipBLASLt git checkout 4b3b34405e7e25cff404f69bfd0a832644430477 ./install.sh -idc

You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.

It looks like the same command lines as described in the README.md of rocm_enabled branch. I used these but let me retry.

Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.

Titus-von-Koeller commented 7 months ago

Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?

amathews-amd commented 7 months ago

Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?

Hi @Titus-von-Koeller , sure! we were planning to reach out to you once we closed some internal dependencies. Is there a forum we can discuss ?

pnunna93 commented 7 months ago

It looks like the same command lines as described in the README.md of rocm_enabled branch. I used these but let me retry.

Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.

Hi @seungduk-yanolja, sounds like there is an issue with hipblaslt build/linking. The version I pointed to has ExtOpMasterLibrary class but something else is going wrong in the build. Please check back on the ROCm issue, they would be able to help. Thanks.

seungduk-yanolja commented 7 months ago

Thank you all. I do not have access to the machine anymore since it was a short-time PoC. There is another PoC scheduled next month so will try again. Thanks again.