Open sammcj opened 2 months ago
The left terminal window is Garuda Linux (Arch) with RTX 3080 The right terminal is Ubuntu with 1080 ti Both computers have the correct drivers and cuda-toolkit When I start them, they show the GPU is detected 0 TFLOPS
Other thing is the computer with the 1080 ti also has a 2060 but exo only sees the 1080.
What command are you using to start the main.py on your Arch machine?
I have a machine running Arch with a 3090 and a 2060. Using CUDA=1 python3 main.py
, exo sees the 3090. When I add a Mac to the network, no inference will take place
Oooo, I didn't realise you had to manually specify CUDA=1 (how odd!) with that it sees one of my three GPUs, but I think that TFLOPs calculation must be for all 3?
When you try to run any inference with it - it doesn't work though. The model downloads, gpu usage on the macbook temporarily spikes, nothing happens on the CUDA hosts and the chat interface sits on 'generating' forever:
What command are you using to start the main.py on your Arch machine? I have a machine running Arch with a 3090 and a 2060. Using
CUDA=1 python3 main.py
, exo sees the 3090. When I add a Mac to the network, no inference will take place
Works! It didn't say that in the README file. I'm now going to make an issues note and annoy Alex Cheema on Xπ
Would be nice to have multiple GPU support.
I wonder if multiple GPU on one computer would work by virtualizing each GPU and running exo separately on each one.
I'm most definitely seeing the same thing as you @sammcj. GPU spikes on the Mac host, nothing happens on the CUDA host other than a compute process being started on the 3090 with very little VRAM assigned.
I found CUDA=1
in this documentation https://docs.tinygrad.org/env_vars/ and even tried applying METAL=1
to main.py on my Mac. Maybe some other variables will help solve this
Hello.
I am on Ubuntu 22.04, 6 nodes with one NVIDIA RTX 4000 ADA each. exo is at 0 TFLOPS.
I am trying to everything in the book / doc /issues, no change.
Thanks in advance. Best Regards. Benjamin.
Yeah, I'm assuming @exo-explore isn't testing against any nvidia cards.
Is there a way to convert the code to pytorch ? Other similar projects are detecting the GPU, but they use Pytorch. Petals is a very good example. Also we should aim for LLama 3.2 with vision, or the code will be obcelete quickly. Besoin Regards. Benjamin.
Le sam. 28 sept. 2024 Γ 05:20, Sam @.***> a Γ©crit :
Yeah, I'm assuming @exo-explore https://github.com/exo-explore isn't testing against any nvidia cards.
β Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2380383760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5ANDYOXPWTL57ELJR4BADZYYOBDAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBQGM4DGNZWGA . You are receiving this because you commented.Message ID: @.***>
Hmm. This is a new one that I haven't seen deployed yet.
1x RTX 3090 (24GB) 2x RTX A4000 (2x16GB)
Oooo, I didn't realise you had to manually specify CUDA=1 (how odd!) with that it sees one of my three GPUs, but I think that TFLOPs calculation must be for all 3?
This might be something easy to fix/implement. Could you possibly try to send in some request just to theorize about what GPU(s) are going to get used?
In a different thread, @AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA π€¦ββοΈOn Ubuntu I had to do:
sudo apt update
sudo apt install nvidia-cuda-toolkit
The main problem with apt-get is getting the versions right. I wanted originally to run CUDA on LXD 24.04 but for some applications that I have tested, it is too early. So I did have to use 22.04 in LXD with an host in 24.04. I canβt go in the details right now in a short mail, but my whole ansible is a lengthy with application like pytorch inside LXD and a python environnement as a result. CUDA is reinstalled each time python side with PIP. ++
Le lun. 30 sept. 2024, 20:50, Vectro Computers @.***> a Γ©crit :
In a different thread, @AlexCheema https://github.com/AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA π€¦ββοΈOn Ubuntu I had to do:
sudo apt install nvidia-cuda-toolkit
β Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2383925603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND6P6D6MJ2V62ZZOSEDZZGMNTAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBTHEZDKNRQGM . You are receiving this because you commented.Message ID: @.***>
Pytorch test Stiller run fi e for exemple.
Le lun. 30 sept. 2024, 21:45, Benjamin LIPERE @.***> a Γ©crit :
The main problem with apt-get is getting the versions right. I wanted originally to run CUDA on LXD 24.04 but for some applications that I have tested, it is too early. So I did have to use 22.04 in LXD with an host in 24.04. I canβt go in the details right now in a short mail, but my whole ansible is a lengthy with application like pytorch inside LXD and a python environnement as a result. CUDA is reinstalled each time python side with PIP. ++
Le lun. 30 sept. 2024, 20:50, Vectro Computers @.***> a Γ©crit :
In a different thread, @AlexCheema https://github.com/AlexCheema mentioned I probably didn't have CUDA installed, and that's why I couldn't get it to recognize the card and had to specify CUDA=1. He was right. I broke down my home AI rig, rebuilt it and then totally derped out and forgot to install CUDA π€¦ββοΈOn Ubuntu I had to do:
sudo apt install nvidia-cuda-toolkit
β Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/192#issuecomment-2383925603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND6P6D6MJ2V62ZZOSEDZZGMNTAVCNFSM6AAAAABNPGTM3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBTHEZDKNRQGM . You are receiving this because you commented.Message ID: @.***>
FYI this is still an issue as of 2024-11-12.
β Web Chat URL (tinychat): http://127.0.0.1:8000 β
β ChatGPT API endpoint: http://127.0.0.1:8000/v1/chat/completions β
β GPU poor βΌ GPU rich β
β [π₯π₯π₯π₯π₯π₯π₯π₯π§π§π§π§π§π§π§π¨π¨π¨π¨π¨π¨π¨π¨π©π©π©π©π©π©π©] β
β 0.00 TFLOPS β
β β² β
β
nvidia-smi
Tue Nov 12 14:38:54 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 31% 32C P8 13W / 310W | 15626MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:16:00.0 Off | N/A |
| 30% 26C P8 11W / 290W | 15344MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Manually exporting CUDA=1
works around the issue.
This is a follow up from https://github.com/exo-explore/exo/issues/46 which wasn't resolved.
My linux machine has:
But Exo thinks they only provide a total of 0 tflops.
Here you can see the only TFLOPS added to the pool are from my macbook pro:
exo from commit dc3b2bde39a5d9b59806bc5970a8fb7fe51b2c75 (Date: 2024-08-30 12:28:24 +0100)