PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

[Bug]: manually setting --max-model-len flag always leads to OOM, even if it is set very low #414

Closed SalomonKisters closed 4 weeks ago

SalomonKisters commented 1 month ago

Your current environment

Simply the aphrofite engine built from source

🐛 Describe the bug

2024-04-17 02:44:20,286 INFO worker.py:1752 -- Started a local Ray instance. INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config: INFO: Model = '/app/TheBloke_OpenHermes-2.5-Mistral-7B-AWQ' INFO: DataType = torch.float16 INFO: Model Load Format = auto INFO: Number of GPUs = 1 INFO: Disable Custom All-Reduce = False INFO: Quantization Format = awq INFO: Context Length = 4000 INFO: Enforce Eager Mode = False INFO: KV Cache Data Type = fp8_e5m2 INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: flash_attn is not found. Using xformers backend. INFO: Model weights loaded. Memory usage: 3.88 GiB x 1 = 3.88 GiB INFO: # GPU blocks: 17194, # CPU blocks: 4096 INFO: Minimum concurrency: 68.78x INFO: Maximum sequence length allowed in the cache: 275104 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/aphrodite/endpoints/openai/api_server.py", line 621, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, *kwargs) File "/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(args, **kwargs) File "/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/aphrodite/executor/ray_gpu_executor.py", line 71, in init self._init_cache() File "/aphrodite/executor/ray_gpu_executor.py", line 267, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/aphrodite/task_handler/worker.py", line 158, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/aphrodite/task_handler/cache_engine.py", line 49, in init self.gpu_cache = self.allocate_gpu_cache() File "/aphrodite/task_handler/cache_engine.py", line 79, in allocate_gpu_cache value_blocks = torch.empty( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 59.62 MiB is free. Process 63325 has 21.52 GiB memory in use. Process 63324 has 2.11 GiB memory in use. Of the allocated memory 19.89 GiB is allocated by PyTorch, and 98.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Without those specifications of 4000 length, everything works fine. The issue is that this will limit me from running larger models.

sgsdxzy commented 1 month ago

It seems you have another program on the same GPU. You need to lower -gmu to make room for it.

SalomonKisters commented 4 weeks ago

It seems you have another program on the same GPU. You need to lower -gmu to make room for it.

Thanks for your answer! I actually had limited my gmu to 0.8 and it worked with the full context length of 32k, but just didnt as soon as I limited it. I still do not quite know what caused the issue, but once I limited to 0.7, loading the model with 4k context worked and 32k did not work anymore, so it seems like it does what it should. I also did not notice it with other models (such as yi-based instead of mistral) Sorry about how unstructured this was, I will close it now :)