Closed Lzy17 closed 7 months ago
Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library.
I am testing the new MI300X machine and having trouble importing bitsandbytes
after successful installation.
I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side.
Please reach out to me through seungduk.kim@yanolja.com
I am a software engineer and I can explain what I am experiencing now.
Thanks!
Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing
bitsandbytes
after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!
Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.
Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing
bitsandbytes
after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.
Reported the issue here: https://github.com/ROCm/ROCm/issues/2885
Hi @pnunna93, yes, I installed it from the rocm_enabled
branch because I saw PRs were merged into this branch.
ChatGPT said this:
The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the torch
package, more precisely within libhipblaslt.so
, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.
The key points in the backtrace indicating the source of the issue are:
ExtOpMasterLibrary
from libhipblaslt.so
, which is part of the ROCm software stack.std::runtime_error
, indicating that an exception was thrown within the C++ standard library, leading to a call to std::terminate()
, which then causes the process to abort.Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.
Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing
bitsandbytes
after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.
Reported the issue here: ROCm/ROCm#2885
Hi @pnunna93, yes, I installed it from the
rocm_enabled
branch because I saw PRs were merged into this branch. ChatGPT said this:The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to the
torch
package, more precisely withinlibhipblaslt.so
, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies.The key points in the backtrace indicating the source of the issue are:
- The termination happens after an attempt to load
ExtOpMasterLibrary
fromlibhipblaslt.so
, which is part of the ROCm software stack.- The crash is preceded by a
std::runtime_error
, indicating that an exception was thrown within the C++ standard library, leading to a call tostd::terminate()
, which then causes the process to abort.Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.
Hi @seungduk-yanolja , please reinstall hipblaslt with these steps: git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt cd hipBLASLt git checkout 4b3b34405e7e25cff404f69bfd0a832644430477 ./install.sh -idc
You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.
Hello, I am sorry to disturb you but I cannot find any place I can report an issue related to this library. I am testing the new MI300X machine and having trouble importing
bitsandbytes
after successful installation. I am unsure if I can disclose the error message here, so I am waiting for a response from any developers on the AMD's side. Please reach out to me through seungduk.kim@yanolja.com I am a software engineer and I can explain what I am experiencing now. Thanks!Hi @seungduk-yanolja , please try installing it from rocm_enabled branch, the instructions are on that page, please be aware that full enablement is still pending. You can report any future issues on https://github.com/ROCm/rocm repo.
Reported the issue here: ROCm/ROCm#2885 Hi @pnunna93, yes, I installed it from the
rocm_enabled
branch because I saw PRs were merged into this branch. ChatGPT said this: The backtrace provided indicates that the core dump resulted from a segmentation fault (SIGABRT) triggered within the Python process. Specifically, the crash occurs during the dynamic loading of a shared library related to thetorch
package, more precisely withinlibhipblaslt.so
, which is part of the ROCm platform for AMD GPUs. This suggests an issue related to the HIP/ROCm ecosystem, possibly due to an incompatibility or a bug in the library or its dependencies. The key points in the backtrace indicating the source of the issue are:
- The termination happens after an attempt to load
ExtOpMasterLibrary
fromlibhipblaslt.so
, which is part of the ROCm software stack.- The crash is preceded by a
std::runtime_error
, indicating that an exception was thrown within the C++ standard library, leading to a call tostd::terminate()
, which then causes the process to abort.Given the complexity of debugging segmentation faults in dynamically loaded libraries, especially within the context of GPU computing, resolving such issues can sometimes require deep technical knowledge of the libraries and the underlying hardware. Collaboration with the community or seeking support from the developers of the libraries involved may be necessary.
Hi @seungduk-yanolja , please reinstall hipblaslt with these steps: git clone --recurse https://github.com/ROCmSoftwarePlatform/hipBLASLt cd hipBLASLt git checkout 4b3b34405e7e25cff404f69bfd0a832644430477 ./install.sh -idc
You may need to copy and relink hipblaslt .so files from build dir to /opt/rocm/lib/ if it doesn't automatically get replaced after build.
It looks like the same command lines as described in the README.md
of rocm_enabled
branch. I used these but let me retry.
Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.
Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?
Hey all! I'm Titus, one of the bitsandbytes maintainers. We currently have a strong push underway to officially make different hardware backends than CUDA possible in BNB. Would you be willing to help us to get the AMD part right and consolidate the code-bases?
Hi @Titus-von-Koeller , sure! we were planning to reach out to you once we closed some internal dependencies. Is there a forum we can discuss ?
It looks like the same command lines as described in the
README.md
ofrocm_enabled
branch. I used these but let me retry.Update: I tried to install hipBLASLt again but there was an error (invalid memory access) and the whole filesystem became read-only. I rebooted the machine and then it did not correctly recognize the GPUs. I rebooted the IPMI and then it became normal. At this moment, what I can do with this machine (MI300X) is run vLLM with 4 out of 8 GPUs because the output became so weird when I used all 8 GPUs. Will try and explore more what I can do.
Hi @seungduk-yanolja, sounds like there is an issue with hipblaslt build/linking. The version I pointed to has ExtOpMasterLibrary class but something else is going wrong in the build. Please check back on the ROCm issue, they would be able to help. Thanks.
Thank you all. I do not have access to the machine anymore since it was a short-time PoC. There is another PoC scheduled next month so will try again. Thanks again.
hipify the wmma api call with rocwmma