Open puppetm4st3r opened 4 hours ago
workarround:
docker exec -it cont_name pip install -U "ray[data,train,tune,serve]"
or override the entrypoint like:
docker run --name "llm_server_$SELECTED_PORT" --user root --runtime nvidia $GPU_OPTION --shm-size 10g --ipc=host \
-e HF_HUB_ENABLE_HF_TRANSFER="true" \
-e RAY_DEDUP_LOGS=1 \
-e HF_HOME="/data" \
-p "$SELECTED_PORT":2242 -v /media/dario/work/dm/dolf/server/data_aphro/cache:/app/aphrodite-engine/.cache -v "$(pwd)/data_aphro:/home/workspace" -v "$(pwd)/datafolder:/data" \
--entrypoint /bin/sh \
alpindale/aphrodite-openai:v0.6.1.post1 \
-c "pip install -U 'ray[data,train,tune,serve]' && python3 -m aphrodite.endpoints.openai.api_server \
--download-dir "/data" --kv-cache-dtype fp8_e5m2 --tensor-parallel-size $NUM_SHARD --model $MODEL_ID --dtype auto \
--gpu-memory-utilization $GPU_MEMORY --max-model-len $MAX_TOTAL_TOKENS \
--swap-space 2 --max-log-len 100000 --disable-log-requests --disable-custom-all-reduce --guided-decoding-backend lm-format-enforcer --api-keys 123 \
--chat-template $CHAT_TEMPLATE --served-model-name $MODEL_NAME --max-num-batched-tokens $MAX_BATCHED_TOKENS --max-num-seqs $MAX_NUM_SEQS $QUANTIZATION_OPTION \
--tokenizer-pool-size 4 --enable-chunked-prefill"
Your current environment
docker isolated environment
🐛 Describe the bug
same issue as this https://github.com/vllm-project/vllm/issues/754