Open heungson opened 1 month ago
It is recommended to use exl2, gptq or awq over gguf. The support for gguf (especially sharded gguf) is unfinished.
It is recommended to use exl2, gptq or awq over gguf. The support for gguf (especially sharded gguf) is unfinished.
Oh I see. Thank you for the reply!
Experimental support of multiple gguf files is added to the dev
branch, please test if it works according to the documentation
@sgsdxzy Thank you for the update. I tried to test 'dev' branch but while the document says
The dev branch extends support for GGUF to all available model architectures besides LLAMA, and sharded (multiple-file) GGUF.
the code has the part which contradicts it what the document says
Only support llama so far if architecture != "llama": raise RuntimeError(f"Unsupported architecture {architecture}, " "only llama is supported.")
Thus, when I tried to run the model 'dranger003/c4ai-command-r-plus-iMat.GGUF', it raises the error.
With the llama 3 model, it raises different error.
(/home/lhs1012/.conda/aphrodite-runtime) lhs1012@ubuntu:/mnt3/lhs1012/laboratory/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server --model /mnt3/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-70B-Instruct-GGUF-v2/snapshots/7549d4063b18c5b0eb91e547a633245ee8fc4cdd/Meta-Llama-3-70B-Instruct-v2.Q5_1-00001-of-00002.gguf --enforce-eager true --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --quantization gguf
INFO: Extracting config from GGUF...
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-05-10 11:09:38,502 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model =
'/mnt3/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-70B-Instruct-GGUF-v2/snapshots/7549d4063b18c5b0eb91e54
7a633245ee8fc4cdd/Meta-Llama-3-70B-Instruct-v2.Q5_1-00001-of-00002.gguf'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gguf
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO: Converting tokenizer from GGUF...
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
File "
Support for sharded ggufs (you are using 00001-of-00002), and other architectures requires pre-convert. You also need to point --model
to the directory containing all the gguf shards, not a single one.
The model must be of LlamaForCausalLM architecture to be loaded directly form GGUF, otherwise the original config.json and other json configs must be present in the directory. The tokenizer must be of LlamaTokenizer architecture to be loaded directly form GGUF, otherwise the original tokenizer must be present in the directory, or optionally use --tokenizer to choose another tokenizer.
Llama3 doesn't use LlamaTokenizer
It succeeded in converting but I got this error when running the model
aphrodite run /mnt3/.cache/huggingface/hub/models--command-r-plus-gguf -tp 2
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-05-13 07:56:52,525 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = '/mnt3/.cache/huggingface/hub/models--command-r-plus-gguf'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = gguf
INFO: Context Length = 131072
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: The tokenizer's vocabulary size 255029 does not match the model's vocabulary size 256000.
INFO: Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO: Using XFormers backend.
(RayWorkerAphrodite pid=2075297) INFO: Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerAphrodite pid=2075297) INFO: Using XFormers backend.
INFO: Aphrodite is using nccl==2.21.5
(RayWorkerAphrodite pid=2075297) INFO: Aphrodite is using nccl==2.21.5
INFO: reading GPU P2P access cache from /home/lhs1012/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=2075297) INFO: reading GPU P2P access cache from /home/lhs1012/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
(RayWorkerAphrodite pid=2075297) WARNING: GGUF tensor name for lm_head.weight not found, this is normal if the model uses tie word embeddings.
(RayWorkerAphrodite pid=2075297) Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
(RayWorkerAphrodite pid=2075297) ERROR: Error executing method load_model. This might cause deadlock in distributed execution.
WARNING: GGUF tensor name for lm_head.weight not found, this is normal if the model uses tie word embeddings.
Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
Traceback (most recent call last):
File "/home/lhs1012/.conda/aphrodite-runtime/bin/aphrodite", line 8, in
Can you test with latest release v0.5.3 and see if the issue still persists?
still the same error with v0.5.3 and also the current main branch
Your current environment
🐛 Describe the bug
I might be missing something. But at the beginning of converting gguf to pytorch state_dict, it fails to find the layer 'blk.0.ffn_gate_exps' in the dictionary 'mapping'
I have no name!@a535e478c460:/tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF$ python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model /tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF/snapshots/e382348c70b7cbadc126025a60c2c9f7445fcddc/WizardLM-2-8x22B.IQ3_XS-00001-of-00005.gguf --dtype auto --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --quantization gguf --enforce-eager --trust-remote-code
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2024-04-16 01:16:53,309 INFO worker.py:1724 -- Started a local Ray instance. INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config: INFO: Model = '/tmp/hub/models--MaziyarPanahi--WizardLM-2-8x22B-GGUF/snapshots/e382348c70b7cbadc126025a60c2c9f7445fcddc/WizardLM-2-8x22B.IQ3_XS-00001-of-00005.gguf' INFO: DataType = torch.float16 INFO: Model Load Format = auto INFO: Number of GPUs = 2 INFO: Disable Custom All-Reduce = False INFO: Quantization Format = gguf INFO: Context Length = 32768 INFO: Enforce Eager Mode = True INFO: KV Cache Data Type = auto INFO: KV Cache Params Path = None INFO: Device = cuda You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in init
self.engine = self._init_engine(*args, *kwargs)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in init
self._init_workers_ray(placement_group)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 283, in _init_workers_ray
self._run_workers(
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 112, in load_model
self.model_runner.load_model()
File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 121, in load_model
self.model = get_model(self.model_config, self.device_config,
File "/app/aphrodite-engine/aphrodite/modeling/loader.py", line 91, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/app/aphrodite-engine/aphrodite/modeling/models/mixtral_quant.py", line 450, in load_weights
for name, loaded_weight in hf_model_weights_iterator(
File "/app/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 293, in hf_model_weights_iterator
for name, param in convert_gguf_to_state_dict(model_name_or_path,
File "/app/aphrodite-engine/aphrodite/modeling/hf_downloader.py", line 271, in convert_gguf_to_state_dict
new_key, output_dim = mapping[layer]
KeyError: 'blk.0.ffn_gate_exps'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Converting GGUF tensors to PyTorch... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1% 1/128 -:--:-- (RayWorkerAphrodite pid=1148) Converting GGUF tensors to PyTorch... 1% 1/128 -:--:-- Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in