exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.56k stars 342 forks source link

Unable to Display TFLOPS on Ubuntu Server and Node Connectivity Issues #178

Closed berry-13 closed 2 weeks ago

berry-13 commented 2 weeks ago

Issue Summary:

I've installed Exo on two systems:

  1. Ubuntu 22.04 server with an RTX 2060 (12GB VRAM) and 32GB RAM
  2. PC with a 4070 (12GB VRAM) and 64GB RAM running on WSL2 (due to code compatibility issues with Windows)

Steps Taken:

Issues Encountered:

  1. On my Ubuntu server, the TFLOPS of the RTX 2060 is not displayed. Both systems correctly detect the GPU model and VRAM, but this specific detail is missing on the server
  2. I haven't found a way to connect these two nodes despite them being on the same network. There’s no firewall (Windows firewall is disabled)
  3. Attempted to run the llama3.1 7B model on the PC, but it saturated both the RAM and VRAM. This seems unusual for a 7B model, which might indicate it's not quantized or possibly a WSL2 limitation

My Questions:

AlexCheema commented 2 weeks ago
  1. The TFLOPS is just a visual bug. I fixed this here: https://github.com/exo-explore/exo/commit/62e372626316a9333cbd9c43aee67ad704068757
  2. They find each other by broadcasting on all interfaces over UDP. If they're not finding each other over the same WiFi network, that's unusual. You could also try connecting them physically.
  3. It's unquantized fp16, so would be ~16GB for llama-3.1-8b. Should fit across your two devices once you get them to discover each other.
berry-13 commented 2 weeks ago

@AlexCheema thanks for the fix! Unfortunately, I can't connect them physically, but they are both on a 10GBIT LAN connection. Are there any other ways to troubleshoot this?

AlexCheema commented 2 weeks ago

@AlexCheema thanks for the fix! Unfortunately, I can't connect them physically, but they are both on a 10GBIT LAN connection. Are there any other ways to troubleshoot this?

Can they reach each other e.g. with ping?

xinchi-he commented 2 weeks ago

Does it support CPU only? I am trying to set up 2 VM instances on GCP to try exo out. I used the python3 main.py to start the exo, but the 2 nodes can not find each other. two node can ping each other, and can use nc to send UDP packet test to port 5678. I also used DEBUG=9 as op said, but the log keep saying i can find 0 peers. Anything I should look into? Thanks!

berry-13 commented 2 weeks ago

Can they reach each other e.g. with ping?

I think that this connection issue is likely related to WSL. Is there a way to get this working on Windows without using WSL?

Here are the logs from Windows:

Traceback (most recent call last):
  File "D:\exo\main.py", line 184, in <module>
    loop.run_until_complete(main())
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "D:\exo\main.py", line 170, in main
    loop.add_signal_handler(s, handle_exit)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 553, in add_signal_handler
    raise NotImplementedError
NotImplementedError

It crashes immediately after I start it

berry-13 commented 2 weeks ago

@AlexCheema closing this as the main request is more clearly explained in #184

the main issue was due to WSL blocking local IP access. I made some code modifications to enable it to start on Windows, and it successfully began connecting the two nodes. I'll wait for Windows support with llama.cpp

AlexCheema commented 2 weeks ago

@AlexCheema closing this as the main request is more clearly explained in #184

the main issue was due to WSL blocking local IP access. I made some code modifications to enable it to start on Windows, and it successfully began connecting the two nodes. I'll wait for Windows support with llama.cpp

Do you mind pushing the code changes somewhere for the network fixes on windows?

berry-13 commented 2 weeks ago

Do you mind pushing the code changes somewhere for the network fixes on windows?

not many changes, to make sure that it won't kill it self I removed these two lines:

  for s in [signal.SIGINT, signal.SIGTERM]:
    loop.add_signal_handler(s, handle_exit)

and added this one:

signal.signal(signal.SIGINT, lambda s, f: handle_exit())