exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
15.6k stars 836 forks source link

Cloud Platform Networking Support - Peer Discovery #187

Open xinchi-he opened 2 months ago

xinchi-he commented 2 months ago

Current networking for peer discovery is based on UDP broadcasting, which is not commonly supported on cloud platforms, thus peer nodes are not able to find each other even though they are located within the same virtual network without any firewall rules enforced.

Setting up two VM instances on GCP with Debian image. This is what appears in the log with DEBUG_DISCOVERY=9 DEBUG=9 python3 main.py:

Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader
Trying to find available port port=60439
[55750, 53702, 58236, 60403, 56807, 55655, 56261, 63656, 64548, 52168, 53646, 49910, 61021, 63850, 65285, 60222, 56650, 56276, 57157]
Using available port: 60439
Retrieved existing node ID: fff5b2ef-1d4d-4170-93ab-8f748d777492
Chat interface started:
 - http://10.128.0.18:9999
 - http://127.0.0.1:9999
ChatGPT API endpoint served at:
 - http://10.128.0.18:9999/v1/chat/completions
 - http://127.0.0.1:9999/v1/chat/completions
tinygrad Device.DEFAULT='CLANG'
Server started, listening on 0.0.0.0:60439
tinygrad Device.DEFAULT='CLANG'
Starting peer discovery process...
Current number of known peers: 0. Waiting 5 seconds to discover more...
No new peers discovered in the last grace period. Ending discovery process.
Collecting topology max_depth=4 visited=set()
Collected topology: Topology(Nodes: {fff5b2ef-1d4d-4170-93ab-8f748d777492: Model: Linux Box (Device: CLANG). 
Chip: Unknown Chip (Device: CLANG). Memory: 15990MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 
TFLOPS}, Edges: {})
Peer statuses: {}
Broadcast presence: b'{"type": "discovery", "node_id": "fff5b2ef-1d4d-4170-93ab-8f748d777492", "grpc_port": 
60439, "device_capabilities": {"model": "Linux Box (Device: CLANG)", "chip": "Unknown Chip (Device: CLANG)", 
"memory": 15990, "flops": {"fp32": 0, "fp16": 0, "int8": 0}}}'
Peer statuses: {}
Broadcast presence: b'{"type": "discovery", "node_id": "fff5b2ef-1d4d-4170-93ab-8f748d777492", "grpc_port": 
60439, "device_capabilities": {"model": "Linux Box (Device: CLANG)", "chip": "Unknown Chip (Device: CLANG)",
FFAMax commented 1 month ago

Do you wanna see all other peers :D or have a chance to configure peers manually should be enough?

xinchi-he commented 1 month ago

@FFAMax thanks for picking up this issue. ideally, I'd like to see all the other peers, starting with manual configuration is good enough for now.