Run with ROCM not working

enzoqtvf commented 1 year ago

Hey team, sorry for the title, I didn't find a nice one.

My issue is highly related to: https://github.com/elixir-nx/xla/issues/29

I've made an attempt to build and run xla with rocm target, I followed the instruction on the above issue and I managed to build xla with rocm target which is great ! Here are the step I followed:

Use rocm/tensorflow:latest docker image
Use nx: 0.5.3, bumblebee: 0.3.0, exla: 0.5.2, xla: 0.4.4
I use a different tensorflow than the one provided in the Makefile as it is mentioned in the issue linked above (otherwise it won't build)

I build xla and exla with the following, where $$repo is: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream.git, and $$hash is a hash commit from this repo(I tried several and I managed to build with this one for example: hash1 or this one: hash2):

TENSORFLOW_GIT_REPO=$$repo  TENSORFLOW_GIT_REV=$$hash ROCBLAS_TENSILE_LIBPATH=/opt/rocm-5.4.0/lib ROCM_PATH=/opt/rocm-5.4.0 XLA_BUILD=true XLA_TARGET=rocm mix deps.compile --force xla
TENSORFLOW_GIT_REPO=$$repo  TENSORFLOW_GIT_REV=$$hash ROCBLAS_TENSILE_LIBPATH=/opt/rocm-5.4.0/lib ROCM_PATH=/opt/rocm-5.4.0 XLA_BUILD=true XLA_TARGET=rocm mix deps.compile --force exla

From there I managed to compile everything I needed fine.

Then I run my service and try to load a model {:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"}), I can see these logs:

[info] XLA service 0x7f610c00f190 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info]   StreamExecutor device (0): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (1): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (2): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (3): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (4): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (5): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (6): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info]   StreamExecutor device (7): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] Using BFC allocator.
[info] XLA backend allocating 30908665036 bytes on device 0 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 1 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 2 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 3 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 4 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 5 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 6 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 7 for BFCAllocator.

And after that I get this error and the elixir application exits:

terminate called after throwing an instance of 'std::bad_variant_access'
                                                                          what():  Unexpected index
                                                                                                   Aborted (core dumped)

I'm not sure if this is the right place to ask, but if anyone knows how I can solve this, or if there is a better way to do what I'm doing, maybe a specific repo/hash combination that does the job, or a different docker image that would help me a lot :)

Thank you very much, let me know if more information are needed :)

seanmor5 commented 1 year ago

Can you try bumping to EXLA 0.5.3?

enzoqtvf commented 1 year ago

@seanmor5 That was it ! Thank you very much for your help !

elixir-nx / xla

Run with ROCM not working #42