[Bug]: Tensile Library Crashes when trying to implement llama_index RAG with AMD GPU using ROCm 6.0.0

sayanmndl21 commented 5 months ago

Describe the bug

Trying to run any llama_index (0.10.16) RAG tutorial with AMD GPU causes a crash when initializing tensile library. The crash doesn't occur the first time it is being run but any subsequent runs causes this:

rocBLAS error: Could not load /opt/rocm-6.0.0/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat

rocBLAS error: Could not initialize Tensile library
Aborted

While trying to debug I figured out that it might have something to do with vectorstoreindex.

To Reproduce

Install stable rocm (rocm install):

sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
#See prerequisites. Adding current user to Video and Render groups
sudo usermod -a -G render,video $LOGNAME
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/focal/amdgpu-install_6.0.60000-1_all.deb
sudo apt install ./amdgpu-install_6.0.60000-1_all.deb
sudo apt update
sudo apt install amdgpu-dkms
sudo apt install rocm
echo "Please reboot system for all settings to take effect."

Install llama-index from pip install llama-index

Reinstall llama-cpp-python with rocm support (might need sudo -H):

python3 -m pip uninstall llama-cpp-python && CMAKE_ARGS="-DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS=gfx90a" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.52 --no-cache-dir

Run tutorial: I am trying this tutorial with local llm: mistral rag tutorial.

Expected behavior

Successfully answer the query.

Log-files

Some lines in the log from debug mode:

DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: We'll get the boys to gether and have the initi...
> Adding chunk: We'll get the boys to gether and have the initi...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: "Have the which?" 
"Have the initiation." 
"Wha...
> Adding chunk: "Have the which?" 
"Have the initiation." 
"Wha...
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: "That's gay—that's mighty gay, Tom, I tell you....
> Adding chunk: "That's gay—that's mighty gay, Tom, I tell you....
DEBUG:llama_index.core.node_parser.node_utils:> Adding chunk: CONCLUSION 
SO endeth this chronicle. It being ...
> Adding chunk: CONCLUSION 
SO endeth this chronicle. It being ...

rocBLAS error: Could not load /opt/rocm-6.0.0/lib/llvm/bin/../../../lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat

rocBLAS error: Could not initialize Tensile library
Aborted

Environment

Hardware	description
CPU	AMD EPYC 9124 16-Core Processor
GPU	AMD Instinct MI210

Software	version
rocm-core	v6.0.0.60000-91~20.04
rocblas	v4.0.0.60000-91~20.04

Attached environment.txt

Additional context

Add any other context about the problem here.

sayanmndl21 commented 5 months ago

Just wanted to update: I've figured out the issue was with the Hugging Face embedding, which was initially using the PyTorch ROCm 5.6 backend. Updating to PyTorch ROCm 6.0 fixed the issue. I did note that Torch+ROCm 5.6 works with ROCm 6.0.0 as usual, but the backend for Hugging Face fails so I couldn't bebug the issue properly. This might help others facing similar issues.

rkamd commented 5 months ago

@sayanmndl21 , Glad that you have figured it out. I will close this ticket as the issue is outside the scope of rocBLAS. You can re-open the ticket if you need any support from rocBLAS.

ROCm / rocBLAS