elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.33k stars 96 forks source link

Way to tell if Bumblebee or Nx or whatever has access to a GPU? #272

Closed brainlid closed 11 months ago

brainlid commented 11 months ago

I'm experimenting with getting Bumblebee to work on a server with a GPU.

Is there a way I can easily tell if Bumblebee has access to the GPU? Currently, I can only tell by trying to do something with it and it takes minutes, telling me it's using the CPU.

Are there any special ENV I should be using to enable it? Here's what I've got:

ENV XLA_TARGET="cuda120"
ENV BUMBLEBEE_CACHE_DIR="/data/cache/bumblebee"
ENV XLA_CACHE_DIR="/data/cache/xla"

Any tips are appreciated!

josevalim commented 11 months ago

Do Nx.tensor(0). If it says it is allocated on the GPU, everything else should be allocated on the CPU. Notice you will need to set your Nx.default_backend(EXLA.Backend) as well.

brainlid commented 11 months ago

How should I interpret this?

> Nx.tensor(0)
#Nx.Tensor<
  s64
  EXLA.Backend<host:0, 0.3195620623.3568173077.121122>
  0
>
josevalim commented 11 months ago

"host" means CPU in this case. It seems gpu is not detected or available. What happens on EXLA.Client.fetch!(:cuda)?

brainlid commented 11 months ago

Here's the result:

> EXLA.Client.fetch!(:cuda)
** (exit) exited in: GenServer.call(EXLA.Client, {:client, :cuda, [platform: :cuda]}, :infinity)
    ** (EXIT) an exception was raised:
        ** (RuntimeError) Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
            (exla 0.6.1) lib/exla/client.ex:195: EXLA.Client.unwrap!/1
            (exla 0.6.1) lib/exla/client.ex:176: EXLA.Client.build_client/2
            (exla 0.6.1) lib/exla/client.ex:136: EXLA.Client.handle_call/3
            (stdlib 4.3) gen_server.erl:1149: :gen_server.try_handle_call/4
            (stdlib 4.3) gen_server.erl:1178: :gen_server.handle_msg/6
            (stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
    (elixir 1.14.3) lib/gen_server.ex:1038: GenServer.call/3
    iex:5: (file)

So it's not seeing it. :thinking:

brainlid commented 11 months ago

I've been using Chris' Livebook gist as a reference: https://gist.github.com/chrismccord/59a5e81f144a4dfb4bf0a8c3f2673131

And the machine VM does have access to the GPU:

root@732875413f7859:/app# nvidia-smi
Tue Nov  7 21:19:37 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   38C    P0              36W / 250W |      4MiB / 40960MiB |      4%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Can't find where the mismatch is.

brainlid commented 11 months ago

Thanks for helping me identify that it can't access the GPU!

josevalim commented 11 months ago

Try following the steps here: https://github.com/elixir-nx/bumblebee/issues/266#issue-1959822664

Also make sure XLA_TARGET is set before you install any deps.

brainlid commented 11 months ago

I needed to move the ENV for XLA_TARGET up in the Dockerfile. That was helpful! Thanks!