ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
336 stars 157 forks source link

[Bug]: Stop breaking backwards compatibility or at least warn #1386

Open danielzgtg opened 8 months ago

danielzgtg commented 8 months ago

Describe the bug

rocBLAS 5.6 fails with a confusing error message when mixed with ROCm 6.0 libraries or TensileLibrary.

To Reproduce

Precise version of rocBLAS installed or rocBLAS commit hash if building from source. Steps to reproduce the behavior:

  1. Install ROCm 6.0
  2. pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
  3. Install https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm
  4. Run https://www.llamaindex.ai/ or https://github.com/AUTOMATIC1111/stable-diffusion-webui

Expected behavior

I should not have to spend an hour debugging this, and only find the problem using gdb. rocBLAS 5.6 should either succeed or give a clear error message when loading the TensileLibrary from rocBLAS 6.0 or when loaded while mixed in with ROCm shared libraries.

Log-files

```console $ ./main.py DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443 Starting new HTTPS connection (1): huggingface.co:443 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/configuration_stablelm_epoch.py HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/config.json HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/modeling_stablelm_epoch.py HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/generation_config.json HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /stabilityai/stablelm-zephyr-3b/resolve/main/tokenizer_config.json HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/config.json HTTP/1.1" 200 0 DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0 https://huggingface.co:443 "HEAD /BAAI/bge-small-en-v1.5/resolve/main/tokenizer_config.json HTTP/1.1" 200 0 DEBUG:llama_index.readers.file.base:> [SimpleDirectoryReader] Total files added: 1 > [SimpleDirectoryReader] Total files added: 1 DEBUG:llama_index.node_parser.node_utils:> Adding chunk: What I Worked On February 2021 Before college... > Adding chunk: What I Worked On February 2021 Before college... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I couldn't have put this into words when I was ... > Adding chunk: I couldn't have put this into words when I was ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So I looked around to see what I could salvage ... > Adding chunk: So I looked around to see what I could salvage ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I didn't want to drop out of grad school, but h... > Adding chunk: I didn't want to drop out of grad school, but h... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: We actually had one of those little stoves, fed... > Adding chunk: We actually had one of those little stoves, fed... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: But Interleaf still had a few years to live yet... > Adding chunk: But Interleaf still had a few years to live yet... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Painting students were supposed to express them... > Adding chunk: Painting students were supposed to express them... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Meanwhile I'd been hearing more and more about ... > Adding chunk: Meanwhile I'd been hearing more and more about ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: In return for that and doing the initial legal ... > Adding chunk: In return for that and doing the initial legal ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Which meant being easy to use and inexpensive. ... > Adding chunk: Which meant being easy to use and inexpensive. ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Nor had I changed my grad student lifestyle sig... > Adding chunk: Nor had I changed my grad student lifestyle sig... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now when I walked past charming little restaura... > Adding chunk: Now when I walked past charming little restaura... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: A lot of Lisp hackers dream of building a new L... > Adding chunk: A lot of Lisp hackers dream of building a new L... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Over the next several years I wrote lots of ess... > Adding chunk: Over the next several years I wrote lots of ess... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: So we just made what seemed like the obvious ch... > Adding chunk: So we just made what seemed like the obvious ch... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: I don't think it was entirely luck that the fir... > Adding chunk: I don't think it was entirely luck that the fir... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: YC was different from other kinds of work I've ... > Adding chunk: YC was different from other kinds of work I've ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: For the rest of 2013 I left running YC more and... > Adding chunk: For the rest of 2013 I left running YC more and... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Now they are, though. Now you could continue us... > Adding chunk: Now they are, though. Now you could continue us... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Notes [1] My experience skipped a step in the ... > Adding chunk: Notes [1] My experience skipped a step in the ... DEBUG:llama_index.node_parser.node_utils:> Adding chunk: Startups had once been much more expensive to s... > Adding chunk: Startups had once been much more expensive to s... rocBLAS error: Could not load /opt/rocm-6.0.0/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat rocBLAS error: Could not initialize Tensile library Aborted (core dumped) ```

Environment

Hardware description
CPU AMD Ryzen 9 5900X 12-Core Processor
GPU AMD Radeon RX 6650 XT
Software version
rocm-core 6.0.0.60000-91~22.04
rocblas 4.0.0.60000-91~22.04

environment.txt

Workaround

Recompile pytorch manually. This will ensure that it loads shared libraries from /opt instead of venv.

mahmoodw commented 8 months ago

Hello @danielzgtg,

Thank you for flagging the need for clearer error messages with ROCm and library version mismatches. Your feedback is vital in refining our library's usability.

Our team will investigate and refine the error notifications to offer guidance for resolving library version disparities. Additionally, we'll clarify any backward compatibility restrictions to assist users in navigating version conflicts more effectively.

We'll keep you updated on our progress as we work to enhance the error messages. Your patience and any additional insights during this process are immensely valuable.

Wasiq

rkamd commented 7 months ago

@danielzgtg , Thanks for reporting the issue, Do you see Tensile Library files in the path? output of this command find /opt/ -name "TensileLibrary_*.dat" would help to debug further.

ghost commented 7 months ago

That explains it. Spent the last week troubleshooting why Rocm suddenly stopped working, turns out to be a backwards compatibility issue. Quite frustrating.

rkamd commented 7 months ago

@danielzgtg and @Trat8547 , we were able to execute the sample rocblas program between the release ROCm 5.6 and ROCm 6.0, and internally we have not received any backward compatibility issues from the Frameworks team either.

Having said that, In general when a major version changes ( we follow semantic versioning) API breaking is expected, and upon reviewing the Release notes we see breaking changes in the HIP, and appropriate notification is published here.

Those changes could have contributed to the issue reported here.

danielzgtg commented 7 months ago

Here: TensorLibrary.txt. I think the TensileLibrary_*.dat files are fine, and the problem is with the (lack of) version detection in the code that reads them.

Your linked https://rocm.docs.amd.com/en/latest/about/release-notes.html#hip appears to only list API breaking changes. What my issue is about is ABI breaking changes.

The problem is that the pytorch ROCm is bundling .so files that overlap with the system versions in /opt/. Perhaps deleting the libroc* files from venv/lib/python3.11/site-packages/torch/lib/ would force the correct version (i.e. the system versions) to be used. Anyway, my issues on the other AMD repo suggested that you fix this unnecessary shared library bundling problem with pytorch, but perhaps rocBLAS itself should detect this problem. I think glibc does this properly and refuses to let the application run if the wrong version is used.

This is why rebuilding pytorch was a workaround for this problem. But I would rather not wait for the long pytorch compile every time, and I also don't want the prepackaged pytorch builds to contain the libroc*.so files that not only inflate the download size to gigabytes or so but furthermore cause version conflicts.