Closed SalomonKisters closed 4 weeks ago
It seems you have another program on the same GPU. You need to lower -gmu
to make room for it.
It seems you have another program on the same GPU. You need to lower
-gmu
to make room for it.
Thanks for your answer! I actually had limited my gmu to 0.8 and it worked with the full context length of 32k, but just didnt as soon as I limited it. I still do not quite know what caused the issue, but once I limited to 0.7, loading the model with 4k context worked and 32k did not work anymore, so it seems like it does what it should. I also did not notice it with other models (such as yi-based instead of mistral) Sorry about how unstructured this was, I will close it now :)
Your current environment
🐛 Describe the bug
2024-04-17 02:44:20,286 INFO worker.py:1752 -- Started a local Ray instance. INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config: INFO: Model = '/app/TheBloke_OpenHermes-2.5-Mistral-7B-AWQ' INFO: DataType = torch.float16 INFO: Model Load Format = auto INFO: Number of GPUs = 1 INFO: Disable Custom All-Reduce = False INFO: Quantization Format = awq INFO: Context Length = 4000 INFO: Enforce Eager Mode = False INFO: KV Cache Data Type = fp8_e5m2 INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: flash_attn is not found. Using xformers backend. INFO: Model weights loaded. Memory usage: 3.88 GiB x 1 = 3.88 GiB INFO: # GPU blocks: 17194, # CPU blocks: 4096 INFO: Minimum concurrency: 68.78x INFO: Maximum sequence length allowed in the cache: 275104 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/aphrodite/endpoints/openai/api_server.py", line 621, in
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/aphrodite/engine/async_aphrodite.py", line 313, in init
self.engine = self._init_engine(*args, *kwargs)
File "/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
return engine_class(args, **kwargs)
File "/aphrodite/engine/aphrodite_engine.py", line 111, in init
self.model_executor = executor_class(model_config, cache_config,
File "/aphrodite/executor/ray_gpu_executor.py", line 71, in init
self._init_cache()
File "/aphrodite/executor/ray_gpu_executor.py", line 267, in _init_cache
self._run_workers("init_cache_engine", cache_config=self.cache_config)
File "/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/aphrodite/task_handler/worker.py", line 158, in init_cache_engine
self.cache_engine = CacheEngine(self.cache_config, self.model_config,
File "/aphrodite/task_handler/cache_engine.py", line 49, in init
self.gpu_cache = self.allocate_gpu_cache()
File "/aphrodite/task_handler/cache_engine.py", line 79, in allocate_gpu_cache
value_blocks = torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 59.62 MiB is free. Process 63325 has 21.52 GiB memory in use. Process 63324 has 2.11 GiB memory in use. Of the allocated memory 19.89 GiB is allocated by PyTorch, and 98.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Without those specifications of 4000 length, everything works fine. The issue is that this will limit me from running larger models.