PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
948 stars 104 forks source link

[Bug]: Error from vllm when trying to load a quant model from docker #753

Open puppetm4st3r opened 4 hours ago

puppetm4st3r commented 4 hours ago

Your current environment

docker isolated environment

🐛 Describe the bug

same issue as this https://github.com/vllm-project/vllm/issues/754

puppetm4st3r commented 4 hours ago

workarround: docker exec -it cont_name pip install -U "ray[data,train,tune,serve]"

or override the entrypoint like:

docker run --name "llm_server_$SELECTED_PORT" --user root --runtime nvidia $GPU_OPTION --shm-size 10g --ipc=host \
  -e HF_HUB_ENABLE_HF_TRANSFER="true" \
  -e RAY_DEDUP_LOGS=1 \
  -e HF_HOME="/data" \
  -p "$SELECTED_PORT":2242 -v /media/dario/work/dm/dolf/server/data_aphro/cache:/app/aphrodite-engine/.cache -v "$(pwd)/data_aphro:/home/workspace" -v "$(pwd)/datafolder:/data" \
  --entrypoint /bin/sh \
  alpindale/aphrodite-openai:v0.6.1.post1 \
  -c "pip install -U 'ray[data,train,tune,serve]' && python3 -m aphrodite.endpoints.openai.api_server \
  --download-dir "/data" --kv-cache-dtype fp8_e5m2 --tensor-parallel-size $NUM_SHARD --model $MODEL_ID --dtype auto \
  --gpu-memory-utilization $GPU_MEMORY --max-model-len $MAX_TOTAL_TOKENS \
  --swap-space 2 --max-log-len 100000 --disable-log-requests --disable-custom-all-reduce --guided-decoding-backend lm-format-enforcer --api-keys 123 \
  --chat-template $CHAT_TEMPLATE --served-model-name $MODEL_NAME --max-num-batched-tokens $MAX_BATCHED_TOKENS --max-num-seqs $MAX_NUM_SEQS $QUANTIZATION_OPTION \
  --tokenizer-pool-size 4 --enable-chunked-prefill"