Closed Awlexus closed 1 month ago
Hey @Awlexus, this could be an issue with the build environment. To be sure, you can alternatively use the Docker scripts (./build.sh rocm
), then use XLA_ARCHIVE_URL=file:///path/to/build.tzr.gz
accordingly.
In case your GPU uses gfx1100 (7900 XTX), you may need to use a more recent XLA revision as per https://github.com/elixir-nx/xla/issues/63#issuecomment-1844195261 (either by setting OPENXLA_GIT_REV
with mix compile
or changing the Makefile directly in case of the Docker build).
Thanks @jonatanklosko, I was able to compile it by using a a more recently xla git ref, but I could not get it to start GPU. I tried again by using the docker script to build it (which took a long time) and experienced the same error. It was able to allocate the memory, but the program would soon after be stopped by the operating system. Not sure where exactly this error comes from.
2023-12-28 23:43:05.394087: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN
[info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[info] XLA service 0x7fa4c018dc30 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info] StreamExecutor device (0): AMD Radeon RX 6900 XT, AMDGPU ISA version: gfx1030
[info] Using BFC allocator.
[info] XLA backend allocating 15446782771 bytes on device 0 for BFCAllocator.
fish: Job 1, 'iex -S mix phx.server $argv' terminated by signal SIGSEGV (Address boundary error)
Hmm, do you do any Nx stuff on boot? Does the error happen every time? I assume it doesn't happen if you use CPU only? You can also try ELIXIR_ERL_OPTIONS="+sssdio 128 +sssdcpu 128"
, though it rather helps with segfaults.
Sorry for the late reply, I was away for a bit.
I'm not sure what changed since then, but now I'm getting a different error message. I already tried to write out a reply, before I noticed the change, so I added it at the end in case it could be helpful.
I now ran into the error message (RuntimeError) bitcode module not found at ./opencl.bc
, which I was able to resolve by setting ROCM_PATH=/opt/rocm
(Mentioning this in case someone else runs into this)
Now I'm running into the following error that soon afterwards causes the OS to send a SIGABRT
2023-12-31 18:56:44.607676: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN
[info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[info] XLA service 0x7fe7ac1707a0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info] StreamExecutor device (0): AMD Radeon RX 6900 XT, AMDGPU ISA version: gfx1030
[info] Using BFC allocator.
[info] XLA backend allocating 15446782771 bytes on device 0 for BFCAllocator.
...
beam.smp: /usr/src/debug/hip-runtime-amd/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:762: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
> do you do any Nx stuff on boot? I've added a serving of openai/whisper to my application's supervision tree, but that should be all ```elixir {:ok, model_info} = Bumblebee.load_model({:hf, @whisper_model}) {:ok, featurizer} = Bumblebee.load_featurizer({:hf, @whisper_model}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, @whisper_model}) {:ok, generation_config} = Bumblebee.load_generation_config({:hf, @whisper_model}) generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100) serving = Bumblebee.Audio.speech_to_text_whisper( model_info, featurizer, tokenizer, generation_config, compile: [batch_size: 4], chunk_num_seconds: 30, stream: true, defn_options: [compiler: EXLA] ) ``` > Does the error happen every time? I assume it doesn't happen if you use CPU only? Yes, it happens every time, before the serving is able to complete a single run
Hmm, this looks like /opt/rocm
is likely a symlink to a more specific version like /opt/rocm-5.7.1
, let's set ROCM_PATH
to that just to be sure. Otherwise maybe there's a certain ROCM HIP package missing in the environment?
I'm running Arch Linux and rely on the packages provided there, so I'm not sure what I could be missing. I have installed every package that pops up when I search for rocm, but just to be sure I've provided a list of the installed packages below.
Hmm, this looks like /opt/rocm is likely a symlink to a more specific version
/opt/rocm
really just links to the packages installed on my system.
$ ls -lah /opt
drwxr-xr-x 34 root root 4.0K Dec 31 18:53 rocm/
I see. It must be something environment related, given that others managed to run it with that revision, but I don't have any more guesses right now.
One alternative would be running stuff inside Docker, though that's not exactly convenient. Or you could try building with the latest openxla revision to see if it's something fixed upstream, but note that this usually requires some adjustments in the build file or/and in exla (depending on how much the xla APIs changed).
We just had a new release, see https://github.com/elixir-nx/xla/issues/82#issuecomment-2124230058. You can try it with ROCm 6.0, and if there are issues, leave a comment on #82 :)
Hi, I've been trying to get GPU support running, but I keep running into this issue. I was first looking at this issue to get it running. I added the dependencies like this:
I made sure to install the dependencies mentioned in this comment (adjusted for arch linux):
And then tried to compile it with
$ XLA_BUILD=true XLA_TARGET=rocm mix compile
Compilation logs