Open lipere123 opened 1 month ago
I'm also intermittently experiencing this, see: https://github.com/exo-explore/exo/issues/235.
Seems like unable properly split model into chunks so model can be portionally loaded across several nodes. (llama 3.1 8B unable split across 3x8GB GPUs)
./exo-cli-3.1-70b.sh hello Go for :
!/bin/bash
/usr/bin/curl --progress-bar --connect-timeout 1800 --max-time 1800 http://edgenode2:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-70b", "messages": [{"role": "user", "content": "hello"}], "temperature": 0.7 }'
{"detail": "Error processing prompt (see logs with DEBUG>=2): <AioRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = \"Unexpected <class 'RuntimeError'>: CUDA Error 2, out of memory\"\n\tdebug_error_string = \"UNKNOWN:Error received from peer {created_time:\"2024-10-10T05:40:20.356329187+00:00\", grpc_status:2, grpc_message:\"Unexpected <class \'RuntimeError\'>: CUDA Error 2, out of memory\"}\"\n>"}