exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
9.72k stars 517 forks source link

Host not healthy. #274

Open lipere123 opened 4 days ago

lipere123 commented 4 days ago

Hello Alex.

Nice changes !!!! :1st_place_medal:

Here a bug that I have : Peer 191cdd5d-a67f-4cf3-b257-b39540056034 at 192.168.193.72:50051 is not healthy. Skipping. Peer 187b3c4f-9521-40ec-9c0c-9ef0740dbf26 at 192.168.193.62:50051 is not healthy. Skipping. Peer 41db25f4-a09a-4062-9680-805be7b81758 at 192.168.193.52:50051 is not healthy. Skipping. Peer 31a53c7b-ce99-42f0-b6dc-0627ac245a6e at 192.168.193.22:50051 is not healthy. Skipping. Peer 9ce039ff-dcf2-4cef-bccc-c0a4953be3cd at 192.168.193.42:50051 is not healthy. Skipping.

That only when you do this : CUDA=1 DEBUG=9 /usr/local/exo/bin/exo --inference-engine tinygrad --node-host $myip --node-port 50051 --max-parallel-downloads 1 --disable-tui --wait-for-peers 1

++ Best Regards. Benjamin.

AlexCheema commented 4 days ago

I think I know what's going on here. The assumption we make on this line is not true: https://github.com/exo-explore/exo/blob/2654f290c3179aa143960e336e8985a8b6f6b72b/exo/networking/udp/udp_discovery.py#L143 i.e. self.known_peers[peer_id][0].addr() is not always of the form {peer_host}:{peer_port}. This would fail on some network setups. It works fine on mine and all I've seen so far but I'm pretty sure this isn't always true.

I will push a fix tomorrow.

lipere123 commented 4 days ago

Ok, thanks for the update. Take your time. 😁

Le jeu. 3 oct. 2024 à 02:09, Alex Cheema @.***> a écrit :

I think I know what's going on here. The assumption we make on this line is not true: https://github.com/exo-explore/exo/blob/2654f290c3179aa143960e336e8985a8b6f6b72b/exo/networking/udp/udp_discovery.py#L143 i.e. self.known_peers[peer_id][0].addr() is not always of the form {peer_host}:{peer_port}

I will push a fix tomorrow.

— Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/274#issuecomment-2390157838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND2MJ5VT56R2D7XFD4DZZSDNPAVCNFSM6AAAAABPIU6UHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJQGE2TOOBTHA . You are receiving this because you authored the thread.Message ID: @.***>

-- LIPERE Benjamin 2 Rue Guillochée 78440, Gargenville 06 26 14 35 20 @.***

lipere123 commented 4 days ago
  if (peer_id not in self.known_peers) or (self.known_peers[peer_id][0].addr() != f"{peer_host}:{peer_port}"):
    new_peer_handle = self.create_peer_handle(peer_id, f"{peer_host}:{peer_port}", device_capabilities)
    # if not await new_peer_handle.health_check():
    #   if DEBUG >= 1: print(f"Peer {peer_id} at {peer_host}:{peer_port} is not healthy. Skipping.")
    #   return
    if DEBUG >= 1: print(f"Adding {peer_id=} at {peer_host}:{peer_port}. Replace existing peer_id: {peer_id in self.known_peers}")
    self.known_peers[peer_id] = (new_peer_handle, time.time(), time.time())
  else:
    if not await self.known_peers[peer_id][0].health_check():
      if DEBUG >= 1: print(f"Peer {peer_id} at {peer_host}:{peer_port} is not healthy. Removing.")
      if peer_id in self.known_peers: del self.known_peers[peer_id]
      return
    self.known_peers[peer_id] = (self.known_peers[peer_id][0], self.known_peers[peer_id][1], time.time())

Give :


/ \ \/ / \ | /> < () | \//\___/

Detected system: Linux Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader Retrieved existing node ID: 9ce039ff-dcf2-4cef-bccc-c0a4953be3cd Chat interface started:

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/exo-src/exo/orchestration/standard_node.py", line 312, in connect_with_timeout await asyncio.wait_for(peer.connect(), timeout) File "/usr/lib/python3.12/asyncio/tasks.py", line 519, in wait_for async with timeouts.timeout(timeout): ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in aexit raise TimeoutError from exc_val TimeoutError Failed to connect peers: ['41db25f4-a09a-4062-9680-805be7b81758@192.168.193.52:50051'] Collecting topology max_depth=4 visited=set() Error collecting topology from 41db25f4-a09a-4062-9680-805be7b81758: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.193.52:50051: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:192.168.193.52:50051: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-10-03T00:50:23.813459773+00:00"}"

Collected topology: Topology(Nodes: {9ce039ff-dcf2-4cef-bccc-c0a4953be3cd: Model: Linux Box (NVIDIA RTX 4000 ADA GENERATION). Chip: NVIDIA RTX 4000 ADA GENERATION. Memory: 20475MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS, 41db25f4-a09a-4062-9680-805be7b81758: Model: Linux Box (NVIDIA RTX 4000 ADA GENERATION). Chip: NVIDIA RTX 4000 ADA GENERATION. Memory: 20475MB. Flops: fp32: 0.00 TFLOPS, fp16: 0.00 TFLOPS, int8: 0.00 TFLOPS}, Edges: {9ce039ff-dcf2-4cef-bccc-c0a4953be3cd: {'41db25f4-a09a-4062-9680-805be7b81758'}, 41db25f4-a09a-4062-9680-805be7b81758: {'9ce039ff-dcf2-4cef-bccc-c0a4953be3cd'}}) Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: False Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True update_peers: added=[] removed=[] updated=[] unchanged=[<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x761ed3950830>] to_disconnect=[] to_connect=[<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x761ed3950830>] Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.16.52:50051. Replace existing peer_id: True Adding peer_id='41db25f4-a09a-4062-9680-805be7b81758' at 192.168.193.52:50051. Replace existing peer_id: True Error connecting peer 41db25f4-a09a-4062-9680-805be7b81758@192.168.193.52:50051:

So okay, I cheating for testing a little. But I have two remarks. One it does not connect, so we need a better print on the test to understard why. Second, I think it because I have 2 two adresses :

May be it try to connect to a 192.168.193 adress that I don't want to. ???

Thanks in advance. Best Regards. Benjamin.

AlexCheema commented 3 days ago

Can you try again? I hope #278 fixes this for you.

lipere123 commented 3 days ago

I am guinguette back home and I Will try.

Le jeu. 3 oct. 2024, 16:37, Alex Cheema @.***> a écrit :

Can you try again? I hope #278 https://github.com/exo-explore/exo/pull/278 fixes this for you.

— Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/274#issuecomment-2391591369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5ANDZICVA42OEA4NRWSETZZVJDBAVCNFSM6AAAAABPIU6UHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJRGU4TCMZWHE . You are receiving this because you authored the thread.Message ID: @.***>

lipere123 commented 3 days ago

I am going back home and I will try

Le jeu. 3 oct. 2024, 16:44, Benjamin LIPERE @.***> a écrit :

I am guinguette back home and I Will try.

Le jeu. 3 oct. 2024, 16:37, Alex Cheema @.***> a écrit :

Can you try again? I hope #278 https://github.com/exo-explore/exo/pull/278 fixes this for you.

— Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/274#issuecomment-2391591369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5ANDZICVA42OEA4NRWSETZZVJDBAVCNFSM6AAAAABPIU6UHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJRGU4TCMZWHE . You are receiving this because you authored the thread.Message ID: @.***>

lipere123 commented 3 days ago

That worked for the bug. But vI still have the problem that I showed you last time. edgenode7-exo-run.log edgenode6-exo-run.log edgenode5-exo-run.log edgenode4-exo-run.log edgenode3-exo-run.log edgenode2-exo-run.log tinychat - Google Chrome_001 ++ Thanks in advance. Best regards. Benjamin.

FFAMax commented 1 day ago

Hello, Ben, please check on machine 192.168.16.22 on what addresses port 50051 is listening. Linux command example: netstat -nlp | grep 50051

lipere123 commented 1 day ago

???? Okay, doing that this afternoon.

Le sam. 5 oct. 2024 à 08:33, FFAMax @.***> a écrit :

Hello, Ben, please check on machine 192.168.16.22 on what addresses port 50051 is listening. Linux command example: netstat -nlp | grep 50051

— Reply to this email directly, view it on GitHub https://github.com/exo-explore/exo/issues/274#issuecomment-2394950229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS5AND7H7WVQCZT2Z5MLFC3ZZ6B3RAVCNFSM6AAAAABPIU6UHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUHE2TAMRSHE . You are receiving this because you authored the thread.Message ID: @.***>

-- LIPERE Benjamin 2 Rue Guillochée 78440, Gargenville 06 26 14 35 20 @.***