Closed brainlid closed 11 months ago
Do Nx.tensor(0)
. If it says it is allocated on the GPU, everything else should be allocated on the CPU. Notice you will need to set your Nx.default_backend(EXLA.Backend)
as well.
How should I interpret this?
> Nx.tensor(0)
#Nx.Tensor<
s64
EXLA.Backend<host:0, 0.3195620623.3568173077.121122>
0
>
"host" means CPU in this case. It seems gpu is not detected or available. What happens on EXLA.Client.fetch!(:cuda)
?
Here's the result:
> EXLA.Client.fetch!(:cuda)
** (exit) exited in: GenServer.call(EXLA.Client, {:client, :cuda, [platform: :cuda]}, :infinity)
** (EXIT) an exception was raised:
** (RuntimeError) Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
(exla 0.6.1) lib/exla/client.ex:195: EXLA.Client.unwrap!/1
(exla 0.6.1) lib/exla/client.ex:176: EXLA.Client.build_client/2
(exla 0.6.1) lib/exla/client.ex:136: EXLA.Client.handle_call/3
(stdlib 4.3) gen_server.erl:1149: :gen_server.try_handle_call/4
(stdlib 4.3) gen_server.erl:1178: :gen_server.handle_msg/6
(stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
(elixir 1.14.3) lib/gen_server.ex:1038: GenServer.call/3
iex:5: (file)
So it's not seeing it. :thinking:
I've been using Chris' Livebook gist as a reference: https://gist.github.com/chrismccord/59a5e81f144a4dfb4bf0a8c3f2673131
And the machine VM does have access to the GPU:
root@732875413f7859:/app# nvidia-smi
Tue Nov 7 21:19:37 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:00:06.0 Off | 0 |
| N/A 38C P0 36W / 250W | 4MiB / 40960MiB | 4% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Can't find where the mismatch is.
Thanks for helping me identify that it can't access the GPU!
Try following the steps here: https://github.com/elixir-nx/bumblebee/issues/266#issue-1959822664
Also make sure XLA_TARGET is set before you install any deps.
I needed to move the ENV for XLA_TARGET up in the Dockerfile. That was helpful! Thanks!
I'm experimenting with getting Bumblebee to work on a server with a GPU.
Is there a way I can easily tell if Bumblebee has access to the GPU? Currently, I can only tell by trying to do something with it and it takes minutes, telling me it's using the CPU.
Are there any special ENV I should be using to enable it? Here's what I've got:
Any tips are appreciated!