Open wjwc opened 2 months ago
Can you try on the latest commit with DEBUG=2
and paste the entire output here please?
Hi, I also encountered the same problem when deploying llama3.1-70B on two Mac Airs. Below is the log output when I execute DEBUG=2 python main.py, ########## is the information I printed myself.
Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Received request: GET / Received request: GET /index.js Received request: GET /index.css Received request: GET /common.css Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Received request: POST /v1/chat/completions Handling chat completions request from 10.23.0.28: {'model': 'llama-3.1-70b', 'messages': [{'role': 'user', 'content': 'hello'}], 'stream': True} Sending prompt from ChatGPT api request_id='fb481b23-7145-4bb2-96dd-69c8489ad095' shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) [fb481b23-7145-4bb2-96dd-69c8489ad095] process prompt: base_shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None [fb481b23-7145-4bb2-96dd-69c8489ad095] forwarding to next shard: base_shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Current partition index: 1 Computed next from: Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80), Topology(Nodes: {master: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS, node1: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS}, Edges: {master: {'node1'}, node1: {'master'}}). Next partition: Partition(node_id='node1', start=0, end=0.5) Sending tensor_or_prompt to node1: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Preemptively starting download for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Preemptively starting download for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Download already in progress for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80). Keeping that one. Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model.safetensors.index.json Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model-00001-of-00008.safetensors File already fully downloaded: config.json File already fully downloaded: model-00002-of-00008.safetensors File already fully downloaded: model-00003-of-00008.safetensors File already fully downloaded: model-00004-of-00008.safetensors File already fully downloaded: model-00005-of-00008.safetensors File already fully downloaded: model-00006-of-00008.safetensors File already fully downloaded: model-00007-of-00008.safetensors File already fully downloaded: model-00008-of-00008.safetensors File already fully downloaded: model.safetensors.index.json File already fully downloaded: special_tokens_map.json File already fully downloaded: tokenizer.json File already fully downloaded: tokenizer_config.json Removing download task for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80): True Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Timeout sending opaque status to node1 Timeout sending opaque status to node1 Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Waiting for response to finish. timeout=500s #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80) [fb481b23-7145-4bb2-96dd-69c8489ad095] process_tensor: tensor.size=90112 tensor.shape=(1, 11, 8192) [] Streaming completion: {'id': 'chatcmpl-fb481b23-7145-4bb2-96dd-69c8489ad095', 'object': 'chat.completion', 'created': 1723595142, 'model': 'llama-3.1-70b', 'system_fingerprint': 'exo_0.0.1', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ''}, 'logprobs': None, 'finish_reason': None, 'delta': {'role': 'assistant', 'content': ''}}]} Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model.safetensors.index.json Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: config.json File already fully downloaded: model-00001-of-00008.safetensors File already fully downloaded: model-00003-of-00008.safetensors File already fully downloaded: model-00002-of-00008.safetensors File already fully downloaded: model-00004-of-00008.safetensors File already fully downloaded: model-00006-of-00008.safetensors File already fully downloaded: model-00007-of-00008.safetensors File already fully downloaded: model-00005-of-00008.safetensors File already fully downloaded: model-00008-of-00008.safetensors File already fully downloaded: model.safetensors.index.json File already fully downloaded: tokenizer.json File already fully downloaded: special_tokens_map.json File already fully downloaded: tokenizer_config.json Removing download task for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80): True ############y array([[[0.030426, -0.0333557, -0.0289917, ..., 0.0476074, -0.0448303, -0.0500488], [-0.138916, -0.0391846, 0.0787964, ..., -0.0496521, -0.0134888, 0.154297], [0.128906, 0.0585632, -0.086731, ..., 0.0101547, 0.224243, 0.0139313], ..., [0.0368347, -0.142334, -0.0925903, ..., -0.01828, -0.0875244, 0.0447693], [-0.111389, -0.191895, -0.156494, ..., -0.0600281, -0.0953979, -0.0894775], [-0.0445862, -0.177002, -0.202026, ..., -0.160889, -0.0799561, -0.168823]]], dtype=float16) Error processing tensor for shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80): [gather] Got indices with invalid dtype. Indices must be integral. Traceback (most recent call last): File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/orchestration/standard_node.py", line 221, in _process_tensor result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/sharded_inference_engine.py", line 30, in infer_tensor output_data: np.ndarray = np.array(self.stateful_sharded_model.step(request_id, mx.array(input_data))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/sharded_model.py", line 49, in step output = self.model(y[None] if self.shard.is_first_layer() else y, cache=self.request_cache[request_id]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/models/llama.py", line 89, in call out = self.model(inputs, cache) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/models/llama.py", line 53, in call h = self.embed_tokens(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/.venv/lib/python3.12/site-packages/mlx/nn/layers/quantized.py", line 95, in call self["weight"][x],
ValueError: [gather] Got indices with invalid dtype. Indices must be integral.
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Collecting topology max_depth=4 visited={'node1'}
Connecting to node1...
Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
The log on node1 is as follows
Removing download task for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80): True "model-00001-of-00008.safetensors": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "model-00002-of-00008.safetensors": [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], "model-00003-of-00008.safetensors": [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], "model-00004-of-00008.safetensors": [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42], "model-00005-of-00008.safetensors": [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53], "model-00006-of-00008.safetensors": [53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64], "model-00007-of-00008.safetensors": [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75], "model-00008-of-00008.safetensors": [75, 76, 77, 78, 79], [868e317c-49f1-43ee-af47-f19e558eb8aa] result size: 98304, is finished: False, buffered tokens: 0 SendPrompt shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n你好<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str='' request_id='868e317c-49f1-43ee-af47-f19e558eb8aa' result: None Current partition index: 0 Computed next from: Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Topology(Nodes: {node1: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS, master: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS}, Edges: {node1: {'master'}, master: {'node1'}}). Next partition: Partition(node_id='master', start=0.5, end=1.0) Sending tensor_or_prompt to master: [[[ 3.0426e-02 -3.3356e-02 -2.8992e-02 ... 4.7607e-02 -4.4830e-02 -5.0049e-02] [-1.3892e-01 -3.9185e-02 7.8796e-02 ... -4.9652e-02 -1.3489e-02 1.5430e-01] [ 1.2891e-01 5.8563e-02 -8.6731e-02 ... 1.0155e-02 2.2424e-01 1.3931e-02] ... [ 2.1423e-02 -9.4482e-02 -7.7942e-02 ... -1.3733e-04 -6.8848e-02 6.5674e-02] [ 8.5297e-03 -1.9690e-01 -7.4646e-02 ... -2.2412e-01 -6.3721e-02 -3.7109e-02] [-3.6163e-02 -1.5234e-01 -9.8083e-02 ... -1.4417e-01 -8.3984e-02 -4.6295e-02]]] Broadcasting opaque status: request_id='868e317c-49f1-43ee-af47-f19e558eb8aa' status='{"type": "node_status", "node_id": "node1", "status": "end_process_prompt", "base_shard": {"model_id": "mlx-community/Meta-Llama-3.1-70B-Instruct-4bit", "start_layer": 0, "end_layer": 39, "n_layers": 80}, "shard": {"model_id": "mlx-community/Meta-Llama-3.1-70B-Instruct-4bit", "start_layer": 0, "end_layer": 39, "n_layers": 80}, "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n\u4f60\u597d<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "image_str": "", "inference_state": null, "request_id": "868e317c-49f1-43ee-af47-f19e558eb8aa", "elapsed_time_ns": 74094475583, "result_size": 0}' Connecting to master... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='master') Collecting topology max_depth=4 visited={'master'} Error sending opaque status to master: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-08-14T10:42:08.5779+08:00"}"
Traceback (most recent call last): File "/Users/liushaoping/Desktop/WorkData/exo/exo/orchestration/standard_node.py", line 376, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=500.0) File "/Users/liushaoping/miniconda3/envs/llm/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/liushaoping/Desktop/WorkData/exo/exo/networking/grpc/grpc_peer_handle.py", line 109, in send_opaque_status await self.stub.SendOpaqueStatus(request) File "/Users/liushaoping/Desktop/WorkData/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-08-14T10:42:08.5779+08:00"}"
Error sending opaque status to master: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:42:08.577905+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Traceback (most recent call last): File "/Users/liushaoping/Desktop/WorkData/exo/exo/orchestration/standard_node.py", line 376, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=500.0) File "/Users/liushaoping/miniconda3/envs/llm/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/liushaoping/Desktop/WorkData/exo/exo/networking/grpc/grpc_peer_handle.py", line 109, in send_opaque_status await self.stub.SendOpaqueStatus(request) File "/Users/liushaoping/Desktop/WorkData/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:42:08.577905+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Error sending opaque status to master: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:42:08.577908+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Traceback (most recent call last): File "/Users/liushaoping/Desktop/WorkData/exo/exo/orchestration/standard_node.py", line 376, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=500.0) File "/Users/liushaoping/miniconda3/envs/llm/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/liushaoping/Desktop/WorkData/exo/exo/networking/grpc/grpc_peer_handle.py", line 109, in send_opaque_status await self.stub.SendOpaqueStatus(request) File "/Users/liushaoping/Desktop/WorkData/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:42:08.577908+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Task exception was never retrieved future: <Task finished name='Task-124' coro=<StandardNode.forward_to_next_shard() done, defined at /Users/liushaoping/Desktop/WorkData/exo/exo/orchestration/standard_node.py:241> exception=<AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:43:22.605247+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Traceback (most recent call last): File "/Users/liushaoping/Desktop/WorkData/exo/exo/orchestration/standard_node.py", line 278, in forward_to_next_shard await target_peer.send_tensor(next_shard, tensor_or_prompt, request_id=request_id, inference_state=inference_state) File "/Users/liushaoping/Desktop/WorkData/exo/exo/networking/grpc/grpc_peer_handle.py", line 74, in send_tensor response = await self.stub.SendTensor(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/Desktop/WorkData/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-08-14T10:43:22.605247+08:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:10.23.0.28:49807: Failed to connect to remote host: Connection refused"}"
Download progress from node1: {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'completed_files': 1, 'total_files': 13, 'downloaded_bytes': 1017, 'downloaded_bytes_this_session': 0, 'total_bytes': 39697862564, 'overall_speed': 0, 'overall_eta': 0.0, 'file_progress': {'config.json': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'config.json', 'downloaded': 1017, 'downloaded_this_session': 0, 'total': 1017, 'speed': 0, 'eta': 0.0, 'status': 'complete'}, 'model-00001-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00001-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5272167770, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00002-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00002-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649694, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00003-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00003-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649717, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00004-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00004-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649733, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00005-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00005-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649717, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00006-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00006-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649733, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00007-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00007-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 5294649739, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model-00008-of-00008.safetensors': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model-00008-of-00008.safetensors', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 2648501502, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'model.safetensors.index.json': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'model.safetensors.index.json', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 158327, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'special_tokens_map.json': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'special_tokens_map.json', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 296, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'tokenizer.json': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'tokenizer.json', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 9084449, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}, 'tokenizer_config.json': {'repo_id': 'mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', 'repo_revision': 'main', 'file_path': 'tokenizer_config.json', 'downloaded': 0, 'downloaded_this_session': 0, 'total': 50870, 'speed': 0, 'eta': 0.0, 'status': 'not_started'}}, 'st
Hi, I also encountered the same problem when deploying llama3.1-70B on two Mac Airs. Below is the log output when I execute DEBUG=2 python main.py, ########## is the information I printed myself.
Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Received request: GET / Received request: GET /index.js Received request: GET /index.css Received request: GET /common.css Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Received request: POST /v1/chat/completions Handling chat completions request from 10.23.0.28: {'model': 'llama-3.1-70b', 'messages': [{'role': 'user', 'content': 'hello'}], 'stream': True} Sending prompt from ChatGPT api request_id='fb481b23-7145-4bb2-96dd-69c8489ad095' shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) [fb481b23-7145-4bb2-96dd-69c8489ad095] process prompt: base_shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None [fb481b23-7145-4bb2-96dd-69c8489ad095] forwarding to next shard: base_shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=0, n_layers=80) shard=Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' image_str=None #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Current partition index: 1 Computed next from: Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80), Topology(Nodes: {master: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS, node1: Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS}, Edges: {master: {'node1'}, node1: {'master'}}). Next partition: Partition(node_id='node1', start=0, end=0.5) Sending tensor_or_prompt to node1: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Preemptively starting download for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=39, n_layers=80), Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Preemptively starting download for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80) Download already in progress for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80). Keeping that one. Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model.safetensors.index.json Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model-00001-of-00008.safetensors File already fully downloaded: config.json File already fully downloaded: model-00002-of-00008.safetensors File already fully downloaded: model-00003-of-00008.safetensors File already fully downloaded: model-00004-of-00008.safetensors File already fully downloaded: model-00005-of-00008.safetensors File already fully downloaded: model-00006-of-00008.safetensors File already fully downloaded: model-00007-of-00008.safetensors File already fully downloaded: model-00008-of-00008.safetensors File already fully downloaded: model.safetensors.index.json File already fully downloaded: special_tokens_map.json File already fully downloaded: tokenizer.json File already fully downloaded: tokenizer_config.json Removing download task for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=40, end_layer=79, n_layers=80): True Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Timeout sending opaque status to node1 Timeout sending opaque status to node1 Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Already connected to node1: True Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Waiting for response to finish. timeout=500s #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80) #################shards [Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80)] #################shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80) [fb481b23-7145-4bb2-96dd-69c8489ad095] process_tensor: tensor.size=90112 tensor.shape=(1, 11, 8192) [] Streaming completion: {'id': 'chatcmpl-fb481b23-7145-4bb2-96dd-69c8489ad095', 'object': 'chat.completion', 'created': 1723595142, 'model': 'llama-3.1-70b', 'system_fingerprint': 'exo_0.0.1', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': ''}, 'logprobs': None, 'finish_reason': None, 'delta': {'role': 'assistant', 'content': ''}}]} Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: model.safetensors.index.json Commit hash is already hashed at /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/refs/main: da51b8316183c357b17fa594c0480539a4ffccc3 Using cached file list from /Users/shaoping_liu/.cache/huggingface/hub/models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit/cachedreqs/da51b8316183c357b17fa594c0480539a4ffccc3/fetch_file_list.json File already fully downloaded: config.json File already fully downloaded: model-00001-of-00008.safetensors File already fully downloaded: model-00003-of-00008.safetensors File already fully downloaded: model-00002-of-00008.safetensors File already fully downloaded: model-00004-of-00008.safetensors File already fully downloaded: model-00006-of-00008.safetensors File already fully downloaded: model-00007-of-00008.safetensors File already fully downloaded: model-00005-of-00008.safetensors File already fully downloaded: model-00008-of-00008.safetensors File already fully downloaded: model.safetensors.index.json File already fully downloaded: tokenizer.json File already fully downloaded: special_tokens_map.json File already fully downloaded: tokenizer_config.json Removing download task for Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80): True ############y array([[[0.030426, -0.0333557, -0.0289917, ..., 0.0476074, -0.0448303, -0.0500488], [-0.138916, -0.0391846, 0.0787964, ..., -0.0496521, -0.0134888, 0.154297], [0.128906, 0.0585632, -0.086731, ..., 0.0101547, 0.224243, 0.0139313], ..., [0.0368347, -0.142334, -0.0925903, ..., -0.01828, -0.0875244, 0.0447693], [-0.111389, -0.191895, -0.156494, ..., -0.0600281, -0.0953979, -0.0894775], [-0.0445862, -0.177002, -0.202026, ..., -0.160889, -0.0799561, -0.168823]]], dtype=float16) Error processing tensor for shard Shard(model_id='mlx-community/Meta-Llama-3.1-70B-Instruct-4bit', start_layer=0, end_layer=79, n_layers=80): [gather] Got indices with invalid dtype. Indices must be integral. Traceback (most recent call last): File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/orchestration/standard_node.py", line 221, in _process_tensor result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/sharded_inference_engine.py", line 30, in infer_tensor output_data: np.ndarray = np.array(self.stateful_sharded_model.step(request_id, mx.array(input_data))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/sharded_model.py", line 49, in step output = self.model(y[None] if self.shard.is_first_layer() else y, cache=self.request_cache[request_id]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/models/llama.py", line 89, in call out = self.model(inputs, cache) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/exo/inference/mlx/models/llama.py", line 53, in call h = self.embed_tokens(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaoping_liu/Desktop/liudong/WorkData/llm_deploy/exo/.venv/lib/python3.12/site-packages/mlx/nn/layers/quantized.py", line 95, in call self["weight"][x],
~~~~~~^^^ ValueError: [gather] Got indices with invalid dtype. Indices must be integral. Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1') Collecting topology max_depth=4 visited={'node1'} Connecting to node1... Connected to peer Model: MacBook Air. Chip: Apple M2. Memory: 16384MB. Flops: fp32: 3.55 TFLOPS, fp16: 7.10 TFLOPS, int8: 14.20 TFLOPS (peer.id()='node1')
Also having the same issue +1
使用exo+mlx多台mac运行llama-3.1-70b,返现量化时报错 报错的位置: quantized.py文件 代码: def call(self, x): s = x.shape x = x.flatten() out = mx.dequantize( self["weight"][x], scales=self["scales"][x], biases=self["biases"][x], group_size=self.group_size, bits=self.bits, ) return out.reshape(*s, -1)
x打印的值为: x is array([[[[0.534668, -0.012146, 0.0141449, ..., 0.0268707, 0.143066, 0.124268], [-0.395752, 0.266846, 0.144653, ..., -0.00493622, 0.220337, 0.483398], [-0.0620728, 0.0823975, -0.101807, ..., 0.183838, 0.0869141, 0.0322876], ..., [-0.15918, 0.0597534, -0.110474, ..., 0.102905, 0.00811768, 0.0138779], [-0.203369, 0.0515747, -0.0604248, ..., 0.0429993, 0.038208, 0.074707], [-0.129517, 0.0895386, -0.138306, ..., -0.0203552, -0.0138397, 0.00897217]]]], dtype=float16)
报错: Error processing tensor for shard Shard(model_id='mlx-community/Meta-Llama-3-70B-Instruct-4bit', start_layer=0, end_layer=52, n_layers=80): [gather] Got indices with invalid dtype. Indices must be integral. Traceback (most recent call last): File "/Users/liushaoping/exo/exo/orchestration/standard_node.py", line 217, in _process_tensor result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/exo/exo/inference/mlx/sharded_inference_engine.py", line 29, in infer_tensor output_data: np.ndarray = np.array(self.stateful_sharded_model.step(request_id, mx.array(input_data))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/exo/exo/inference/mlx/sharded_model.py", line 48, in step output = self.model(y[None] if self.shard.is_first_layer() else y, cache=self.request_cache[request_id]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/exo/exo/inference/mlx/models/llama.py", line 89, in call out = self.model(inputs, cache) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/exo/exo/inference/mlx/models/llama.py", line 53, in call h = self.embed_tokens(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/liushaoping/exo/.venv/lib/python3.12/site-packages/mlx/nn/layers/quantized.py", line 103, in call self["weight"][x],