I've made an attempt to build and run xla with rocm target, I followed the instruction on the above issue and I managed to build xla with rocm target which is great !
Here are the step I followed:
Use rocm/tensorflow:latest docker image
Use nx: 0.5.3, bumblebee: 0.3.0, exla: 0.5.2, xla: 0.4.4
I use a different tensorflow than the one provided in the Makefile as it is mentioned in the issue linked above (otherwise it won't build)
From there I managed to compile everything I needed fine.
Then I run my service and try to load a model {:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"}), I can see these logs:
[info] XLA service 0x7f610c00f190 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info] StreamExecutor device (0): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (1): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (2): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (3): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (4): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (5): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (6): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] StreamExecutor device (7): , AMDGPU ISA version: gfx908:sramecc+:xnack-
[info] Using BFC allocator.
[info] XLA backend allocating 30908665036 bytes on device 0 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 1 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 2 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 3 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 4 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 5 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 6 for BFCAllocator.
[info] XLA backend allocating 30908665036 bytes on device 7 for BFCAllocator.
And after that I get this error and the elixir application exits:
terminate called after throwing an instance of 'std::bad_variant_access'
what(): Unexpected index
Aborted (core dumped)
I'm not sure if this is the right place to ask, but if anyone knows how I can solve this, or if there is a better way to do what I'm doing, maybe a specific repo/hash combination that does the job, or a different docker image that would help me a lot :)
Thank you very much, let me know if more information are needed :)
Hey team, sorry for the title, I didn't find a nice one.
My issue is highly related to: https://github.com/elixir-nx/xla/issues/29
I've made an attempt to build and run xla with rocm target, I followed the instruction on the above issue and I managed to build xla with rocm target which is great ! Here are the step I followed:
{:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
, I can see these logs:And after that I get this error and the elixir application exits:
I'm not sure if this is the right place to ask, but if anyone knows how I can solve this, or if there is a better way to do what I'm doing, maybe a specific repo/hash combination that does the job, or a different docker image that would help me a lot :)
Thank you very much, let me know if more information are needed :)