exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.56k stars 342 forks source link

Unstable node connection and model loading error when using Termux with Proot on multiple Android devices #156

Closed artistlu closed 3 weeks ago

artistlu commented 3 weeks ago

the error

Compilation error: 1
Command: ['clang', '-include', 'tgmath.h', '-shared', '-O2', '-Wall', '-Werror', '-x', 'c', '-fPIC', '-', 
'-o', '/tmp/tmpgqkxfa2o']
Stdout: 
Stderr: <stdin>:2:18: error: __bf16 is not supported on this target
void E_4194304_4(__bf16* restrict data0) {
                 ^
<stdin>:5:22: error: __bf16 is not supported on this target
    data0[alu0+1] = (__bf16)(0.0);
                     ^
<stdin>:5:29: error: cannot type-cast to __bf16
    data0[alu0+1] = (__bf16)(0.0);
                            ^~~~~
<stdin>:6:22: error: __bf16 is not supported on this target
    data0[alu0+2] = (__bf16)(0.0);
                     ^
<stdin>:6:29: error: cannot type-cast to __bf16
    data0[alu0+2] = (__bf16)(0.0);
                            ^~~~~
<stdin>:7:22: error: __bf16 is not supported on this target
    data0[alu0+3] = (__bf16)(0.0);
                     ^
<stdin>:7:29: error: cannot type-cast to __bf16
    data0[alu0+3] = (__bf16)(0.0);
                            ^~~~~
<stdin>:8:20: error: __bf16 is not supported on this target
    data0[alu0] = (__bf16)(0.0);
                   ^
<stdin>:8:27: error: cannot type-cast to __bf16
    data0[alu0] = (__bf16)(0.0);
                          ^~~~~
9 errors generated.
Broadcasting opaque status: request_id='fb2ff680-fc74-4d4c-9e01-d9f75d5031d6' status='{"type": "node_status", "node_id": "06df765a-a7ad-4dd4-9ed6-bae8dae6fcf2", "status": "start_process_prompt", 
"base_shard": {"model_id": "/nasroot/models/Meta-Llama-3-8B/", "start_layer": 0, "end_layer": 11, 
"n_layers": 32}, "shard": {"model_id": "/nasroot/models/Meta-Llama-3-8B/", "start_layer": 0, "end_layer": 
11, "n_layers": 32}, "prompt": "<|im_start|>user\\nWhat is the meaning of 
exo?<|im_end|>\\n<|im_start|>assistant\\n", "image_str": "", "inference_state": null, "request_id": 
"fb2ff680-fc74-4d4c-9e01-d9f75d5031d6"}'
Collecting topology max_depth=4 visited={'6a51cf79-6b26-43d3-8f7b-29872e0e177f', 
'e1f6b76c-6b40-4152-a237-becd7c35e5ae'}
Error sending opaque status to 6a51cf79-6b26-43d3-8f7b-29872e0e177f: <AioRpcError of RPC that terminated 
with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:172.20.10.6:7890: 
Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer  
{created_time:"2024-08-19T08:02:11.441285252+00:00", grpc_status:14, grpc_message:"failed to connect to all 
addresses; last error: UNAVAILABLE: ipv4:172.20.10.6:7890: Socket closed"}"
>
Traceback (most recent call last):
  File "/root/exo/exo/orchestration/standard_node.py", line 376, in send_status_to_peer
    await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0)
  File "/root/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/root/exo/exo/networking/grpc/grpc_peer_handle.py", line 109, in send_opaque_status
    await self.stub.SendOpaqueStatus(request)
  File "/root/miniconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:172.20.10.6:7890: 
Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer  
{created_time:"2024-08-19T08:02:11.441285252+00:00", grpc_status:14, grpc_message:"failed to connect to all 
addresses; last error: UNAVAILABLE: ipv4:172.20.10.6:7890: Socket closed"}"
>
Connecting to 6a51cf79-6b26-43d3-8f7b-29872e0e177f...
AlexCheema commented 3 weeks ago

Can you try running with SUPPORT_BF16=0

if that works I’m going to do some work to automatically detect this as it’s come up before

artistlu commented 3 weeks ago

Can you try running with SUPPORT_BF16=0

if that works I’m going to do some work to automatically detect this as it’s come up before

SUPPORT_BF16=0 works