Closed SalomonKisters closed 3 weeks ago
Just got another error this time using WizardLM-2-8x22b (mixtral finetune):
INFO: Model weights loaded. Memory usage: 17.20 GiB x 4 = 68.78 GiB
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/aphrodite/endpoints/openai/api_server.py", line 621, in
Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?
Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?
Main branch, but building from source. Ill try using dev and update you on the results :) Might take a while though, as the build from source taks like an hour for me for some reason
Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?
I tried to do it with dev, but a library that is used by aphrodite was not installed and so it failed. This is the code I am using in my docker image:
(I tried with git checkout dev as well)
RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /tmp/aphrodite-engine && \ mv /tmp/aphrodite-engine/* . && \ rm -rf /tmp/aphrodite-engine && \ chmod +x docker/entrypoint.sh && \ python3 -m venv $APHRODITE_VENV && \ /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ python3 -m pip install --no-cache-dir -e . && \ deactivate"
This is the error trace installing dev from source, when I try to start the engine up:
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 14, in
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in
Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?
I tried to do it with dev, but a library that is used by aphrodite was not installed and so it failed. This is the code I am using in my docker image:
(I tried with git checkout dev as well)
RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /tmp/aphrodite-engine && mv /tmp/aphrodite-engine/* . && rm -rf /tmp/aphrodite-engine && chmod +x docker/entrypoint.sh && python3 -m venv $APHRODITE_VENV && /bin/bash -c "source $APHRODITE_VENV/bin/activate && python3 -m pip install --no-cache-dir -e . && deactivate"
This is the error trace installing dev from source, when I try to start the engine up: INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 14, in import cupy ModuleNotFoundError: No module named 'cupy'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(args, kwargs) File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 68, in init** self._init_workers_ray(placement_group) File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 205, in _init_workers_ray self._run_workers("init_model", File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 102, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 286, in init_distributed_environment cupy_utils.init_process_group( File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 76, in init_process_group raise ImportError( ImportError: NCCLBackend is not available. Please install cupy. (RayWorkerAphrodite pid=2900) ERROR: Error executing method init_model. This might cause deadlock in (RayWorkerAphrodite pid=2900) distributed execution. (RayWorkerAphrodite pid=3073) ERROR: Error executing method init_model. This might cause deadlock in [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerAphrodite pid=3073) distributed execution. [repeated 2x across cluster]
Update: using this - pip install cupy-cuda12x in the instance seemed to fix it, will retry later
I retried with the fix and it merely changed the issue:
WARNING: Admin key not provided. Admin operations will be disabled.
WARNING: awq quantization is not fully optimized yet. The speed can be slower
than non-quantized models.
INFO: Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory
footprint and boosts the performance. But it may cause slight accuracy drop.
Currently we only support fp8 without scaling factors and use e5m2 as a default
format.
2024-04-25 12:08:54,673 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1
as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1
to mute this warning.
2024-04-25 12:08:54,674 WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 26.87999 to 26.
2024-04-25 12:08:54,846 INFO worker.py:1749 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = '/app/casperhansen_llama-3-8b-instruct-awq-main'
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = awq
INFO: Context Length = 8000
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = fp8_e5m2
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: flash_attn is not found. Using xformers backend.
INFO: Model weights loaded. Memory usage: 5.34 GiB x 1 = 5.34 GiB
INFO: # GPU blocks: 61048, # CPU blocks: 4096
INFO: Minimum concurrency: 122.10x
INFO: Maximum sequence length allowed in the cache: 976768
INFO: Capturing the model for CUDA graphs. This may lead to unexpected
consequences if the model is not static. To run the model in eager mode, set
'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING: CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are
running out of memory, consider decreasing gpu_memory_utilization
or enforcing
eager mode.
Capturing graph... 0% 0/35 -:--:--
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in
Invoked with: tensor([[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433], [-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796], [-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072], ..., [-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507], [-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116], [-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],
[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433],
[-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796],
[-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072],
...,
[-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507],
[-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116],
[-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],
[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433],
[-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796],
[-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072],
...,
[-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507],
[-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116],
[-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],
...,
[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433],
[-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796],
[-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072],
...,
[-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507],
[-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116],
[-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],
[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433],
[-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796],
[-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072],
...,
[-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507],
[-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116],
[-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],
[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433],
[-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796],
[-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072],
...,
[-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507],
[-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116],
[-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]]],
device='cuda:0', dtype=torch.float16), tensor([[[-0.0085, -0.0052, 0.0432, ..., 0.0155, -0.0157, 0.0518],
Did you recompile (execute pip install -e .
) aphrodite after switching to dev? You need to recompile it.
Did you recompile (execute
pip install -e .
) aphrodite after switching to dev? You need to recompile it.
Yeah I am building a docker image, and I rebuilt it entirely, also incorporating the fix I found. It also still works on main.
This is the code, using nvidia/cuda:12.1.1-devel-ubuntu22.04 as base:
` RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /app/aphrodite-engine && \ cd /app/aphrodite-engine && \ git checkout dev && \ chmod +x docker/entrypoint.sh && \ python3 -m venv $APHRODITE_VENV && \ /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ python3 -m pip install --no-cache-dir -e . && \ deactivate"
RUN /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ pip install cupy-cuda12x && \ deactivate" `
Thank you for your help
Found the issue in this line: --kv-cache-dtype "fp8_e5m2" \ This is no longer supported and replaced by fp8 on dev
Okay, I finally got it running, thanks for the help
Your current environment
🐛 Describe the bug
Description
Complete error message:
File "/aphrodite/task_handler/model_runner.py", line 765, in profile_run self.execute_model(seqs, kv_caches) File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/aphrodite/task_handler/model_runner.py", line 700, in execute_model hidden_states = model_executable( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 414, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward hidden_states, residual = layer(positions, hidden_states, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward hidden_states = self.block_sparse_moe(hidden_states) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 154, in forward batch_size, sequence_length, hidden_dim = hidden_states.shape ValueError: not enough values to unpack (expected 3, got 2) (RayWorkerAphrodite pid=37067) ERROR: Error executing method profile_num_available_blocks. This might cause (RayWorkerAphrodite pid=37067) deadlock in distributed execution. (RayWorkerAphrodite pid=37260) INFO: Model weights loaded. Memory usage: 17.20 GiB x 4 = 68.80 GiB [repeated 2x across cluster] (RayWorkerAphrodite pid=37260) ERROR: Error executing method profile_num_available_blocks. This might cause [repeated 2x across cluster] (RayWorkerAphrodite pid=37260) deadlock in distributed execution. [repeated 2x across cluster]
Thank you for your help!