PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
926 stars 100 forks source link

Bad generation with GGUF and OpenAI api #319

Closed ccdv-ai closed 1 week ago

ccdv-ai commented 6 months ago

Hi

I tried to generate some text using a mixtral instruct GGUF model but the model only predicts nonsense. Something is either wrong with the tokenizer or the chat template. I tried to convert the model manually using this script but I get the same behavior.

python -m aphrodite.endpoints.openai.api_server  \
    --model "mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf" \
    --tokenizer "mistralai/Mixtral-8x7B-Instruct-v0.1" \
    --quantization "gguf" \
    --port 8001 \
    --host 0.0.0.0 \
    --dtype "half" \
    --served-model-name mixtral \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --kv-cache-dtype auto \
    --seed 123 \
    --max-num-seqs 1 \
    --enforce-eager 

Edit: using the pip package (v0.5.0) Edit2: building from source leads to this error

File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/vocab_parallel_embedding.py", line 123, in forward
    output_parallel = self.linear_method.apply_embedding(
  File "/home/user/.conda/envs/generation/lib/python3.10/site-packages/aphrodite/modeling/layers/quantization/gguf.py", line 152, in apply_embedding
    dequant = ops.ggml_dequantize(quant, weight_type, hidden_size,
RuntimeError: Unknown layout
AlpinDale commented 6 months ago

Can confirm this happens with mixtral. Investigating.