Closed heungson closed 1 month ago
Also getting this error for turboderp/command-r-plus-103B-exl2 on 2x4090s on Runpod (EDIT: and also Dracones/c4ai-command-r-v01_exl2_3.0bpw
on 1x4090) with latest official Aphrodite Docker image as of writing:
alpindale/aphrodite-engine@sha256:b1e72201654a172e044a13d9346264a8b4e562dba8f3572bd92f013cf5420eb1
CMD_ADDITIONAL_ARGUMENTS="--model turboderp/command-r-plus-103B-exl2 --revision 3.0bpw --tokenizer-revision 3.0bpw --quantization exl2 --max-model-len 4096 --kv-cache-dtype fp8 --dtype float16 --enforce-eager true"
PORT=7860
HF_HUB_ENABLE_HF_TRANSFER=1
NUM_GPUS=2
I wonder if these are related?
But latest official Docker image should have that change:
So maybe not related. I tried setting UID
environment variable to 0
and 1000
, and I tried --user=root
as additional Docker run arg, but I get the same error:
@AlpinDale Please ignore if this issue is a wontfix (and please forgive this ping in that case :pray:) -- just in case this slipped through the cracks: I can reproduce OP's issue. See my above comment for reproduction details + logs. The TL;DR is that command-r-plus
doesn't seem to work with a basic Aphrodite setup (e.g. exl2 weights, Runpod w/ official docker image, as above).
Edit: I can also reproduce with Dracones/c4ai-command-r-v01_exl2_3.0bpw
(i.e. issue seems to occur with both command-r
and command-r-plus
)
I'll get to investigating this soon; I've been busy with other projects so I haven't had much time to work on aphrodite lately. I have an inkling that this is related to torch.compile().
Your current environment
aphrodite docker container
Setting 1 GPUs: RTX8000 * 2 model: alpindale/c4ai-command-r-plus-GPTQ Quantization: gptq
Setting 2 GPUs: A6000 ada * 4 model: CohereForAI/c4ai-command-r-plus Quantization: load-in-smooth
🐛 Describe the bug
Starting Aphrodite Engine API server...
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. warnings.warn( WARNING: gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2024-05-17 02:21:49,653 INFO worker.py:1749 -- Started a local Ray instance. INFO: Initializing the Aphrodite Engine (v0.5.3) with the following config: INFO: Model = 'alpindale/c4ai-command-r-plus-GPTQ' INFO: Speculative Config = None INFO: DataType = torch.float16 INFO: Model Load Format = auto INFO: Number of GPUs = 2 INFO: Disable Custom All-Reduce = False INFO: Quantization Format = gptq INFO: Context Length = 29000 INFO: Enforce Eager Mode = True INFO: KV Cache Data Type = auto INFO: KV Cache Params Path = None INFO: Device = cuda INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines') Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING: The tokenizer's vocabulary size 255029 does not match the model's vocabulary size 256000. /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning:resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. warnings.warn( INFO: Cannot use FlashAttention backend for Volta and Turing GPUs. INFO: Using XFormers backend. (RayWorkerAphrodite pid=1127) INFO: Cannot use FlashAttention backend for Volta and Turing GPUs. (RayWorkerAphrodite pid=1127) INFO: Using XFormers backend. INFO: Aphrodite is using nccl==2.20.5 (RayWorkerAphrodite pid=1127) INFO: Aphrodite is using nccl==2.20.5 INFO: generating GPU P2P access cache for in /app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json INFO: reading GPU P2P access cache from /app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json (RayWorkerAphrodite pid=1127) INFO: reading GPU P2P access cache from (RayWorkerAphrodite pid=1127) /app/aphrodite-engine/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json (RayWorkerAphrodite pid=1127) INFO: Using model weights format ['.safetensors'] INFO: Using model weights format ['.safetensors'] INFO: Model weights loaded. Memory usage: 27.78 GiB x 2 = 55.55 GiB rank0: Traceback (most recent call last): rank0: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 562, inrank0: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
rank0: You can suppress this exception and fall back to eager by setting: rank0: import torch._dynamo rank0: torch._dynamo.config.suppress_errors = True
(RayWorkerAphrodite pid=1127) INFO: Model weights loaded. Memory usage: 27.78 GiB x 2 = 55.55 GiB (RayWorkerAphrodite pid=1127) ERROR: Error executing method determine_num_available_blocks. This might (RayWorkerAphrodite pid=1127) cause deadlock in distributed execution. [W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
This is the log generated with gptq version. The same errors are raised when running with non quantized version of the model. gptq version works fine on vllm.