exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.36k stars 329 forks source link

"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:127.0.0.1:7897: Socket closed" #143

Open lesong36 opened 1 month ago

lesong36 commented 1 month ago

(.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ ^C (.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ ^C (.venv) (base) coty@P16:~/OneDrive/LLM/repo/exo$ DEBUG=9 python3 main.py None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


/ \ \/ / \ | /> < () | \//\___/

Detected system: Linux Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader Trying to find available port port=50355 [60304, 55379, 57624, 60258, 57340, 58850, 53290, 55123, 57105, 59823, 50717] Using available port: 50355 Retrieved existing node ID: d639030c-62f3-47c5-bc1f-0ee22be53e67 Chat interface started:

AlexCheema commented 1 month ago

Tl;dr: We need more robust connection management. One annoying issue right now after we introduced sticky node ids is that if a node restarts and changes its ephemeral port, other nodes may still try to talk to it on the previous port assigned to that node id.

The good thing is this is all pretty easy to fix just requires a small refactor of networking.

lesong36 commented 1 month ago

Tl;dr: We need more robust connection management. One annoying issue right now after we introduced sticky node ids is that if a node restarts and changes its ephemeral port, other nodes may still try to talk to it on the previous port assigned to that node id.

The good thing is this is all pretty easy to fix just requires a small refactor of networking.

Thanks for your answer. Anything I can do for this issue?