exo-explore / exo

Run your own AI cluster at home with everyday devices πŸ“±πŸ’» πŸ–₯️⌚
GNU General Public License v3.0
14.96k stars 807 forks source link

Exo not detecting Nvidia GPUs #192

Open sammcj opened 2 months ago

sammcj commented 2 months ago

This is a follow up from https://github.com/exo-explore/exo/issues/46 which wasn't resolved.

My linux machine has:

But Exo thinks they only provide a total of 0 tflops.

Here you can see the only TFLOPS added to the pool are from my macbook pro:

image

exo from commit dc3b2bde39a5d9b59806bc5970a8fb7fe51b2c75 (Date: 2024-08-30 12:28:24 +0100)

vectrocomputers commented 2 months ago

Screenshot_20240902_212605-1 The left terminal window is Garuda Linux (Arch) with RTX 3080 The right terminal is Ubuntu with 1080 ti Both computers have the correct drivers and cuda-toolkit When I start them, they show the GPU is detected 0 TFLOPS

Other thing is the computer with the 1080 ti also has a 2060 but exo only sees the 1080.

peaster commented 2 months ago

What command are you using to start the main.py on your Arch machine? I have a machine running Arch with a 3090 and a 2060. Using CUDA=1 python3 main.py, exo sees the 3090. When I add a Mac to the network, no inference will take place

sammcj commented 2 months ago

Oooo, I didn't realise you had to manually specify CUDA=1 (how odd!) with that it sees one of my three GPUs, but I think that TFLOPs calculation must be for all 3?

image
sammcj commented 2 months ago

When you try to run any inference with it - it doesn't work though. The model downloads, gpu usage on the macbook temporarily spikes, nothing happens on the CUDA hosts and the chat interface sits on 'generating' forever:

image image
vectrocomputers commented 2 months ago

What command are you using to start the main.py on your Arch machine? I have a machine running Arch with a 3090 and a 2060. Using CUDA=1 python3 main.py, exo sees the 3090. When I add a Mac to the network, no inference will take place

Works! It didn't say that in the README file. I'm now going to make an issues note and annoy Alex Cheema on XπŸ˜‚

Would be nice to have multiple GPU support.

vectrocomputers commented 2 months ago

I wonder if multiple GPU on one computer would work by virtualizing each GPU and running exo separately on each one.

peaster commented 2 months ago

I'm most definitely seeing the same thing as you @sammcj. GPU spikes on the Mac host, nothing happens on the CUDA host other than a compute process being started on the 3090 with very little VRAM assigned.

I found CUDA=1 in this documentation https://docs.tinygrad.org/env_vars/ and even tried applying METAL=1 to main.py on my Mac. Maybe some other variables will help solve this

lipere123 commented 1 month ago

Hello.

I am on Ubuntu 22.04, 6 nodes with one NVIDIA RTX 4000 ADA each. exo is at 0 TFLOPS.

I am trying to everything in the book / doc /issues, no change.

Thanks in advance. Best Regards. Benjamin.

sammcj commented 1 month ago

Yeah, I'm assuming @exo-explore isn't testing against any nvidia cards.

lipere123 commented 1 month ago

Is there a way to convert the code to pytorch ? Other similar projects are detecting the GPU, but they use Pytorch. Petals is a very good example. Also we should aim for LLama 3.2 with vision, or the code will be obcelete quickly. Besoin Regards. Benjamin.

Le sam. 28 sept. 2024 Γ  05:20, Sam @.***> a Γ©crit :

Yeah, I'm assuming @exo-explore https://github.com/exo-explore isn't testing against any nvidia cards.

β€” Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2380383760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5ANDYOXPWTL57ELJR4BADZYYOBDAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBQGM4DGNZWGA . You are receiving this because you commented.Message ID: @.***>

larson-carter commented 1 month ago

Hmm. This is a new one that I haven't seen deployed yet.

1x RTX 3090 (24GB) 2x RTX A4000 (2x16GB)

Oooo, I didn't realise you had to manually specify CUDA=1 (how odd!) with that it sees one of my three GPUs, but I think that TFLOPs calculation must be for all 3?

This might be something easy to fix/implement. Could you possibly try to send in some request just to theorize about what GPU(s) are going to get used?

vectrocomputers commented 1 month ago

In a different thread, @AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA πŸ€¦β€β™‚οΈOn Ubuntu I had to do:

sudo apt update sudo apt install nvidia-cuda-toolkit

lipere123 commented 1 month ago

The main problem with apt-get is getting the versions right. I wanted originally to run CUDA on LXD 24.04 but for some applications that I have tested, it is too early. So I did have to use 22.04 in LXD with an host in 24.04. I can’t go in the details right now in a short mail, but my whole ansible is a lengthy with application like pytorch inside LXD and a python environnement as a result. CUDA is reinstalled each time python side with PIP. ++

Le lun. 30 sept. 2024, 20:50, Vectro Computers @.***> a Γ©crit :

In a different thread, @AlexCheema https://github.com/AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA πŸ€¦β€β™‚οΈOn Ubuntu I had to do:

sudo apt install nvidia-cuda-toolkit

β€” Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2383925603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND6P6D6MJ2V62ZZOSEDZZGMNTAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBTHEZDKNRQGM . You are receiving this because you commented.Message ID: @.***>

lipere123 commented 1 month ago

Pytorch test Stiller run fi e for exemple.

Le lun. 30 sept. 2024, 21:45, Benjamin LIPERE @.***> a Γ©crit :

The main problem with apt-get is getting the versions right. I wanted originally to run CUDA on LXD 24.04 but for some applications that I have tested, it is too early. So I did have to use 22.04 in LXD with an host in 24.04. I can’t go in the details right now in a short mail, but my whole ansible is a lengthy with application like pytorch inside LXD and a python environnement as a result. CUDA is reinstalled each time python side with PIP. ++

Le lun. 30 sept. 2024, 20:50, Vectro Computers @.***> a Γ©crit :

In a different thread, @AlexCheema https://github.com/AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA πŸ€¦β€β™‚οΈOn Ubuntu I had to do:

sudo apt install nvidia-cuda-toolkit

β€” Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2383925603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND6P6D6MJ2V62ZZOSEDZZGMNTAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBTHEZDKNRQGM . You are receiving this because you commented.Message ID: @.***>

sammcj commented 6 days ago

FYI this is still an issue as of 2024-11-12.

β”‚                                           Web Chat URL (tinychat): http://127.0.0.1:8000                                                                                                   β”‚
β”‚                                  ChatGPT API endpoint: http://127.0.0.1:8000/v1/chat/completions                                                                                           β”‚
β”‚                          GPU poor   β–Ό                                                            GPU rich                                                                                  β”‚
β”‚                                   [πŸŸ₯πŸŸ₯πŸŸ₯πŸŸ₯πŸŸ₯πŸŸ₯πŸŸ₯πŸŸ₯🟧🟧🟧🟧🟧🟧🟧🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩]                                                                                           β”‚
β”‚                                0.00 TFLOPS                                                                                                                                                 β”‚
β”‚                                     β–²                                                                                                                                                      β”‚
β”‚
nvidia-smi
Tue Nov 12 14:38:54 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
| 31%   32C    P8             13W /  310W |   15626MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:16:00.0 Off |                  N/A |
| 30%   26C    P8             11W /  290W |   15344MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Manually exporting CUDA=1 works around the issue.