exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.56k stars 342 forks source link

Cloud Platform Networking Support - Peer Discovery #187

Open xinchi-he opened 2 weeks ago

xinchi-he commented 2 weeks ago

Current networking for peer discovery is based on UDP broadcasting, which is not commonly supported on cloud platforms, thus peer nodes are not able to find each other even though they are located within the same virtual network without any firewall rules enforced.

Setting up two VM instances on GCP with Debian image. This is what appears in the log with DEBUG_DISCOVERY=9 DEBUG=9 python3 main.py:

Detected system: Linux
Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader
Trying to find available port port=60439
[55750, 53702, 58236, 60403, 56807, 55655, 56261, 63656, 64548, 52168, 53646, 49910, 61021, 63850, 65285, 60222, 56650, 56276, 57157]
Using available port: 60439
Retrieved existing node ID: fff5b2ef-1d4d-4170-93ab-8f748d777492
Chat interface started:
 - http://10.128.0.18:9999
 - http://127.0.0.1:9999
ChatGPT API endpoint served at:
 - http://10.128.0.18:9999/v1/chat/completions
 - http://127.0.0.1:9999/v1/chat/completions
tinygrad Device.DEFAULT='CLANG'
Server started, listening on 0.0.0.0:60439
tinygrad Device.DEFAULT='CLANG'
Starting peer discovery process...
Current number of known peers: 0. Waiting 5 seconds to discover more...
No new peers discovered in the last grace period. Ending discovery process.
Collecting topology max_depth=4 visited=set()
Collected topology: Topology(Nodes: {fff5b2ef-1d4d-4170-93ab-8f748d777492: Model: Linux Box (Device: CLANG). 
Chip: Unknown Chip (Device: CLANG). Memory: 15990MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 
TFLOPS}, Edges: {})
Peer statuses: {}
Broadcast presence: b'{"type": "discovery", "node_id": "fff5b2ef-1d4d-4170-93ab-8f748d777492", "grpc_port": 
60439, "device_capabilities": {"model": "Linux Box (Device: CLANG)", "chip": "Unknown Chip (Device: CLANG)", 
"memory": 15990, "flops": {"fp32": 0, "fp16": 0, "int8": 0}}}'
Peer statuses: {}
Broadcast presence: b'{"type": "discovery", "node_id": "fff5b2ef-1d4d-4170-93ab-8f748d777492", "grpc_port": 
60439, "device_capabilities": {"model": "Linux Box (Device: CLANG)", "chip": "Unknown Chip (Device: CLANG)",