Unable to Display TFLOPS on Ubuntu Server and Node Connectivity Issues

berry-13 commented 2 weeks ago

Issue Summary:

I've installed Exo on two systems:

Ubuntu 22.04 server with an RTX 2060 (12GB VRAM) and 32GB RAM
PC with a 4070 (12GB VRAM) and 64GB RAM running on WSL2 (due to code compatibility issues with Windows)

Steps Taken:

Tried starting the application with python3 main.py and also with debugging enabled: DEBUG_DISCOVERY=9 DEBUG=9 python3 main.py --inference-engine tinygrad
Verified that both the API endpoint and chat URL are accessible and listening on their respective IPs
Uninstalled all NVIDIA drivers and reinstalled the latest NVIDIA drivers and CUDA on both machines

Issues Encountered:

On my Ubuntu server, the TFLOPS of the RTX 2060 is not displayed. Both systems correctly detect the GPU model and VRAM, but this specific detail is missing on the server
I haven't found a way to connect these two nodes despite them being on the same network. There’s no firewall (Windows firewall is disabled)
Attempted to run the llama3.1 7B model on the PC, but it saturated both the RAM and VRAM. This seems unusual for a 7B model, which might indicate it's not quantized or possibly a WSL2 limitation

My Questions:

Why the TFLOPS isn't showing on the Ubuntu server's RTX 2060
How to successfully connect the two nodes on the same network
Any insights on the potential RAM/VRAM saturation issue with the llama3.1 7B model on WSL2

AlexCheema commented 2 weeks ago

The TFLOPS is just a visual bug. I fixed this here: https://github.com/exo-explore/exo/commit/62e372626316a9333cbd9c43aee67ad704068757
They find each other by broadcasting on all interfaces over UDP. If they're not finding each other over the same WiFi network, that's unusual. You could also try connecting them physically.
It's unquantized fp16, so would be ~16GB for llama-3.1-8b. Should fit across your two devices once you get them to discover each other.

berry-13 commented 2 weeks ago

@AlexCheema thanks for the fix! Unfortunately, I can't connect them physically, but they are both on a 10GBIT LAN connection. Are there any other ways to troubleshoot this?

AlexCheema commented 2 weeks ago

@AlexCheema thanks for the fix! Unfortunately, I can't connect them physically, but they are both on a 10GBIT LAN connection. Are there any other ways to troubleshoot this?

Can they reach each other e.g. with ping?

xinchi-he commented 2 weeks ago

Does it support CPU only? I am trying to set up 2 VM instances on GCP to try exo out. I used the python3 main.py to start the exo, but the 2 nodes can not find each other. two node can ping each other, and can use nc to send UDP packet test to port 5678. I also used DEBUG=9 as op said, but the log keep saying i can find 0 peers. Anything I should look into? Thanks!

berry-13 commented 2 weeks ago

Can they reach each other e.g. with ping?

I think that this connection issue is likely related to WSL. Is there a way to get this working on Windows without using WSL?

Here are the logs from Windows:

Traceback (most recent call last):
  File "D:\exo\main.py", line 184, in <module>
    loop.run_until_complete(main())
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "D:\exo\main.py", line 170, in main
    loop.add_signal_handler(s, handle_exit)
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 553, in add_signal_handler
    raise NotImplementedError
NotImplementedError

It crashes immediately after I start it

berry-13 commented 2 weeks ago

@AlexCheema closing this as the main request is more clearly explained in #184

the main issue was due to WSL blocking local IP access. I made some code modifications to enable it to start on Windows, and it successfully began connecting the two nodes. I'll wait for Windows support with llama.cpp

AlexCheema commented 2 weeks ago

@AlexCheema closing this as the main request is more clearly explained in #184

the main issue was due to WSL blocking local IP access. I made some code modifications to enable it to start on Windows, and it successfully began connecting the two nodes. I'll wait for Windows support with llama.cpp

Do you mind pushing the code changes somewhere for the network fixes on windows?

berry-13 commented 2 weeks ago

Do you mind pushing the code changes somewhere for the network fixes on windows?

not many changes, to make sure that it won't kill it self I removed these two lines:

  for s in [signal.SIGINT, signal.SIGTERM]:
    loop.add_signal_handler(s, handle_exit)

and added this one:

signal.signal(signal.SIGINT, lambda s, f: handle_exit())

exo-explore / exo

Unable to Display TFLOPS on Ubuntu Server and Node Connectivity Issues #178