PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
1.09k stars 120 forks source link

Is GGUF support broken? #281

Closed davideuler closed 1 month ago

davideuler commented 7 months ago

I try to start the service with GGUF models in RTX 4090 to test the GGUF performance. It shows error, I am not sure if the GGUF support has been broken. I start the service with the command :

python -m aphrodite.endpoints.openai.api_server  --model Mixtral_11Bx2_MoE_19B-GGUF/ --quantization gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral  --disable-log-requests --gpu-memory-utilization 0.8

Error message: miniconda3/envs/fast-llm-serving/lib/python3.10/site-packages/aphrodite/common/config.py", line 136, in _verify_load_format if "MixtralForCausalLM" in architectures and load_format == "pt": TypeError: argument of type 'NoneType' is not iterable

AlpinDale commented 7 months ago

Are you on the latest release or building from source?

Aphrodite now accepts direct .gguf files, so please don't convert if you're building from source.

AlpinDale commented 7 months ago

Can confirm that I can run the same model (Q4_K_M) on the latest build by supplying the .gguf file.

davideuler commented 7 months ago

Are you on the latest release or building from source?

Aphrodite now accepts direct .gguf files, so please don't convert if you're building from source.

I am on the latest release version v0.4.9. I'll try to run it by building from current source.

davideuler commented 7 months ago

I've build the package from latest source. And gguf models does not work on my RTX 4090.

For GPTQ and AWQ models, it works like a charm. Aphrodite engine is the fastest inference engine in my test across dozens of engine. Thank for the great job.

Is there anything I missed for gguf models? Do I need to prepare a special config.json or specify gguf with other parameters?

I start the model from local directory with:

python -m aphrodite.endpoints.openai.api_server  --model LHK_DPO_v1_GGUF/ -q gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral  --disable-log-requests --gpu-memory-utilization 0.8

Error shows: OSError: Can't load tokenizer for 'LHK_DPO_v1_GGUF/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'LHK_DPO_v1_GGUF/' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

I downloaded the gguf model from here: https://huggingface.co/owao/LHK_DPO_v1_GGUF/

Here is the folder structure:

$ ls LHK_DPO_v1_GGUF/
config.json  LHK_DPO_v1_Q8_0.gguf  README.md

Could you show the command to start gguf model on your machine, and what's the directory structure for gguf models? Thanks.

davideuler commented 7 months ago

I've got the gguf model to load, but still failed with Out Of Memory on RTX 4090 24G VRAM.

python -m aphrodite.endpoints.openai.api_server  --model LHK_DPO_v1_GGUF/LHK_DPO_v1_Q8_0.gguf -q gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral --gpu-memory-utilization 0.9 --max-model-len 2048

The model file size is 13GB. I wonder why it still OOM even the VARM is 24GB, far more larger than the model file.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 39.81 MiB is free. Including non-PyTorch memory, this process has 23.60 GiB memory in use. Of the allocated memory 22.74 GiB is allocated by PyTorch, and 182.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

davideuler commented 7 months ago

Finally got a much smaller model to work. However the inference result is not the desired answer. It seems the gguf model did not understand user's prompt. The same model works really well on m1 ultra.

python -m aphrodite.endpoints.openai.api_server  --model LHK_DPO_v1_GGUF/LHK_DPO_v1_Q8_0.gguf -q gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral --gpu-memory-utilization 0.7 --max-model-len 5120 --kv-cache-dtype auto
AlpinDale commented 7 months ago

Sorry for the late response, @davideuler

Please set --enforce-eager as well to save memory, since it seems to not play very well with GGUF models. If you're still experiencing issues, you may also want to set --kv-cache-dtype fp8_e5m2 for more memory savings.

As for the inconsistent results, can you compare generations with all samplers disabled and temperature=0?

curl -X POST http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Once upon a time, ",
"model": "mixtral",
"temperature": 0.0,
"max_tokens": 128
}' | jq .

Compare the results generated vs on your M1 Ultra with llama.cpp. Make sure the prompt matches.

davideuler commented 7 months ago

Finally got a much smaller model to work. However the inference result is not the desired answer. It seems the gguf model did not understand user's prompt. The same model works really well on m1 ultra.

python -m aphrodite.endpoints.openai.api_server  --model LHK_DPO_v1_GGUF/LHK_DPO_v1_Q8_0.gguf -q gguf --port 5000 --host 0.0.0.0 --served-model-name mixtral --gpu-memory-utilization 0.7 --max-model-len 5120 --kv-cache-dtype auto

I've confirmed that the command does not work with release version of v0.4.9.

davideuler commented 7 months ago

Sorry for the late response, @davideuler

Please set --enforce-eager as well to save memory, since it seems to not play very well with GGUF models. If you're still experiencing issues, you may also want to set --kv-cache-dtype fp8_e5m2 for more memory savings.

As for the inconsistent results, can you compare generations with all samplers disabled and temperature=0?

curl -X POST http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Once upon a time, ",
"model": "mixtral",
"temperature": 0.0,
"max_tokens": 128
}' | jq .

Compare the results generated vs on your M1 Ultra with llama.cpp. Make sure the prompt matches.

That's ok, thanks for the comment, --enforce-eager really helps. I can start the Mixtral_11Bx2_MoE_19B-GGUF q8 model with the parameter in 24GB rtx 4090.

And the command works as you show. chat-ui does not work with the api service started by aphrodite gguf. There seems to be some bug in chat-ui. But for GPTQ/AWQ models service by aphrodite engine, it works. I'm not sure what's the problem, still in troubleshooting.

curl -X POST http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Once upon a time, ",
"model": "mixtral",
"temperature": 0.0,
"max_tokens": 128
}' | jq .