Open Star-98 opened 1 week ago
A 70B model at --dtype float32
takes ~280GB of vram, in addition to context. 96GB won't cut it.
thank you But with float16 or other small models I get the same error. A different solution is needed.
Can you list one small model that has the same error?
NeverSleep/Llama-3-Lumimaid-8B-v0.1
is this
Below is the log generated when executing the 8B model.
(RayWorkerAphrodite pid=5826) INFO: Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerAphrodite pid=5826) INFO: Using XFormers backend.
INFO: Aphrodite is using nccl==2.20.5
(RayWorkerAphrodite pid=5826) INFO: Aphrodite is using nccl==2.20.5
INFO: NVLink detection failed with message "Not Supported". This is normal
if your machine has no NVLink equipped
WARNING: Custom allreduce is disabled because it's not supported on more than
two PCIe-only GPUs. To silence this warning, specify
disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=5826) INFO: NVLink detection failed with message "Not Supported". This is normal
(RayWorkerAphrodite pid=5826) if your machine has no NVLink equipped
(RayWorkerAphrodite pid=5826) WARNING: Custom allreduce is disabled because it's not supported on more than
(RayWorkerAphrodite pid=5826) two PCIe-only GPUs. To silence this warning, specify
(RayWorkerAphrodite pid=5826) disable_custom_all_reduce=True explicitly.
INFO: Using model weights format ['*.safetensors']
model-00001-of-00002.safetensors: 0%| | 0.00/9.95G [00:00<?, ?B/s(RayWorkerAphrodite pid=5826) INFO: Using model weights format ['*.safetensors']
model-00002-of-00002.safetensors: 100%|████| 6.11G/6.11G [03:47<00:00, 26.9MB/s]
model-00001-of-00002.safetensors: 100%|████| 9.95G/9.95G [04:45<00:00, 34.9MB/s]
INFO: Model weights loaded. Memory usage: 3.74 GiB x 4 = 14.97 GiB
(RayWorkerAphrodite pid=5826) INFO: Model weights loaded. Memory usage: 3.74 GiB x 4 = 14.97 GiB
(RayWorkerAphrodite pid=5959) INFO: Cannot use FlashAttention backend for Volta and Turing GPUs. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerAphrodite pid=5959) INFO: Using XFormers backend. [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) INFO: Aphrodite is using nccl==2.20.5 [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) INFO: NVLink detection failed with message "Not Supported". This is normal [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) if your machine has no NVLink equipped [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) WARNING: Custom allreduce is disabled because it's not supported on more than [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) two PCIe-only GPUs. To silence this warning, specify [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) disable_custom_all_reduce=True explicitly. [repeated 2x across cluster]
(RayWorkerAphrodite pid=5959) INFO: Using model weights format ['*.safetensors'] [repeated 2x across cluster]
(RayWorkerAphrodite pid=5826) ERROR: Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerAphrodite pid=5959) INFO: Model weights loaded. Memory usage: 3.74 GiB x 4 = 14.97 GiB [repeated 2x across cluster]
[rank0]: Traceback (most recent call last):
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]: File "<frozen runpy>", line 88, in _run_code
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 562, in <module>
[rank0]: run_server(args)
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
[rank0]: engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
[rank0]: engine = cls(engine_config.parallel_config.worker_use_ray,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 142, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 182, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 208, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/executor/ray_gpu_executor.py", line 325, in _run_workers
[rank0]: ray_worker_outputs = ray.get(ray_worker_outputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/ray/_private/worker.py", line 2630, in get
[rank0]: values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/ray/_private/worker.py", line 863, in get_objects
[rank0]: raise value.as_instanceof_cause()
[rank0]: ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerAphrodite.execute_method() (pid=5826, ip=192.168.0.105, actor_id=8c8ad58e8edf8246105f3c4b01000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f3136c36290>)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/ray_tools.py", line 43, in execute_method
[rank0]: raise e
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/ray_tools.py", line 36, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/task_handler/worker.py", line 153, in determine_num_available_blocks
[rank0]: assert peak_memory > 0, (
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing Aphrodite.
(RayWorkerAphrodite pid=5959) ERROR: Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. [repeated 2x across cluster]
@AlpinDale any idea?
Looks like it's having trouble during the memory profiling stage. Can you try a GGUF model?
Automatic download did not work, so I downloaded it manually. Other than that it is the same.
python -m aphrodite.endpoints.openai.api_server --model /home/star_/models/bartowski_L3-8B-Stheno-v3.2-Q8_0L3-8B-Stheno-v3.2-Q8_0/L3-8B-Stheno-v3.2-Q8_0.gguf \
--dtype float16 \
--worker-use-ray \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--context-shift \
--swap-space 8 \
--gpu-memory-utilization 0.98 \
--device cuda \
INFO: Extracting config from GGUF...
WARNING: gguf quantization is not fully optimized yet. The speed can be slower than non-quantized
models.
INFO: CUDA_HOME is not found in the environment. Using /usr/local/cuda as CUDA_HOME.
INFO: Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the
performance. But it may cause slight accuracy drop without scaling factors. FP8_E5M2 (without
scaling) is only supported on cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead
supported for common inference criteria.
WARNING: Possibly too large swap space. 32.00 GiB out of the 62.51 GiB total CPU memory is allocated
for the swap space.
2024-06-27 08:03:53,809 INFO worker.py:1770 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO: Model =
'/home/star_/models/bartowski_L3-8B-Stheno-v3.2-Q8_0L3-8B-Stheno-v3.2-Q8_0/L3-8B-Stheno-v3.2-Q8_0.ggu
f'
INFO: Speculative Config = None
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 4
INFO: Disable Custom All-Reduce = True
INFO: Quantization Format = gguf
INFO: Context Length = 8192
INFO: Enforce Eager Mode = True
INFO: KV Cache Data Type = fp8
INFO: KV Cache Params Path = None
INFO: Device = cuda
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
WARNING: Possibly too large swap space. 32.00 GiB out of the 62.51 GiB total CPU memory is allocated
for the swap space.
INFO: Converting tokenizer from GGUF...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 562, in <module>
run_server(args)
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/endpoints/openai/api_server.py", line 519, in run_server
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 358, in from_engine_args
engine = cls(engine_config.parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/async_aphrodite.py", line 429, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 125, in __init__
self._init_tokenizer()
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/engine/aphrodite_engine.py", line 246, in _init_tokenizer
self.tokenizer: BaseTokenizerGroup = get_tokenizer_group(
^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
return TokenizerGroup(**init_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/transformers_utils/tokenizer.py", line 136, in get_tokenizer
return convert_gguf_to_tokenizer(tokenizer_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/star_/.conda/envs/aphrodite/lib/python3.11/site-packages/aphrodite/transformers_utils/tokenizer.py", line 44, in convert_gguf_to_tokenizer
scores = result.fields['tokenizer.ggml.scores']
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'tokenizer.ggml.scores'
does --context-shift
work with --kv-cache-dtype fp8
? can you try removing one of them from lanuch args using NeverSleep/Llama-3-Lumimaid-8B-v0.1?
failed. Also, if I remove all arguments and run it, it still fails. Something strange seems to be happening... I will try reinstalling aphrodite again tomorrow.
python -m aphrodite.endpoints.openai.api_server --model /home/star_/models/bartowski_L3-8B-Stheno-v3.2-Q8_0L3-8B-Stheno-v3.2-Q8_0/L3-8B-Stheno-v3.2-Q8_0.gguf \
--gpu-memory-utilization 0.98 \
--device cuda \
Let me correct what I said. For regular models, not gguf, just remove --tensor-parallel-size and it will work.
Today I tried NeverSleep/Llama3-Luminaid-8B-v0.1.
Since p40 does not support bfloat16
operation, I ran it after adding the --dtype float16
argument.
If you remove --tensor-parallel-size, it seems to run normally.
Your current environment
🐛 Describe the bug
An error occurs when you try to load a model on Multi GPU. I have 96GB of VRAM. Why does OOM occur?
parameters
nvidia-smi
log