deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
MIT License
3.68k stars 156 forks source link

Error executing method determine_num_available_blocks: vLLM multi node fails for both DeepSeek-Coder-V2-Instruct and DeepSeek-Coder-V2-Lite-Instruct #76

Open liangfang opened 4 months ago

liangfang commented 4 months ago

首先想问一下DeepSeek有没有试过在vLLM multi node上运行过? 我是通过ray在2个node x 8 GPUs V100上以half(float16)运行

这是运行参数:

CUDA_LAUNCH_BLOCKING=1 OMP_NUM_THREADS=1 vllm serve deepseek-ai/DeepSeek-Coder-V2-Instruct --tensor-parallel-size 16 --dtype half --trust-remote-code --enforce-eager --enable-chunked-prefill=False

DeepSeek-Coder-V2-Lite-Instruct也是在determine_num_available_blocks 处fails, 但是报一个NCCL error:

(RayWorkerWrapper pid=23558, ip=10.0.128.18) ERROR 07-28 13:53:40 worker_base.py:382] RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers


root@g02-17:/vllm-workspace# PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True CUDA_LAUNCH_BLOCKING=1  OMP_NUM_THREADS=1 HF_ENDPOINT="https://hf-mirror.com" vllm serve deepseek-ai/DeepSeek-Coder-V2-Instruct --tensor-parallel-size 16 --dtype half --trust-remote-code --enforce-eager --enable-chunked-prefill=False --max-model-len 8192
INFO 07-28 13:54:35 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-28 13:54:35 api_server.py:220] args: Namespace(model_tag='deepseek-ai/DeepSeek-Coder-V2-Instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='deepseek-ai/DeepSeek-Coder-V2-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=16, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fa6bae68d30>)
WARNING 07-28 13:54:37 config.py:1425] Casting torch.bfloat16 to torch.float16.
INFO 07-28 13:54:37 config.py:715] Defaulting to use ray for distributed inference
2024-07-28 13:54:37,131 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 10.0.128.17:6379...
2024-07-28 13:54:37,140 INFO worker.py:1788 -- Connected to Ray cluster.
INFO 07-28 13:54:38 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='deepseek-ai/DeepSeek-Coder-V2-Instruct', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-Coder-V2-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=deepseek-ai/DeepSeek-Coder-V2-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 13:55:28 selector.py:54] Using XFormers backend.
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:54] Using XFormers backend.
[W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2
INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=77233) INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=77233) INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 07-28 13:55:33 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='10.0.128.17', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fa3914c32b0>, local_subscribe_port=59601, local_sync_port=35331, remote_subscribe_port=53347, remote_sync_port=58569)
(RayWorkerWrapper pid=24524, ip=10.0.128.18) WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes.
INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct...
(RayWorkerWrapper pid=24524, ip=10.0.128.18) INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct...
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:28 selector.py:54] Using XFormers backend. [repeated 14x across cluster]
(RayWorkerWrapper pid=24524, ip=10.0.128.18) Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
INFO 07-28 13:55:33 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-28 13:55:33 selector.py:54] Using XFormers backend.
INFO 07-28 13:55:34 weight_utils.py:223] Using model weights format ['*.safetensors']
(RayWorkerWrapper pid=24600, ip=10.0.128.18) INFO 07-28 13:55:35 weight_utils.py:223] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/55 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/55 [00:01<01:00,  1.11s/it]
Loading safetensors checkpoint shards:   4% Completed | 2/55 [00:02<01:02,  1.18s/it]
Loading safetensors checkpoint shards:   5% Completed | 3/55 [00:03<01:01,  1.18s/it]
Loading safetensors checkpoint shards:   7% Completed | 4/55 [00:04<01:00,  1.18s/it]
Loading safetensors checkpoint shards:   9% Completed | 5/55 [00:05<00:59,  1.19s/it]
Loading safetensors checkpoint shards:  11% Completed | 6/55 [00:07<00:57,  1.17s/it]
Loading safetensors checkpoint shards:  13% Completed | 7/55 [00:08<00:56,  1.18s/it]
Loading safetensors checkpoint shards:  15% Completed | 8/55 [00:09<00:55,  1.18s/it]
Loading safetensors checkpoint shards:  16% Completed | 9/55 [00:10<00:54,  1.18s/it]
Loading safetensors checkpoint shards:  18% Completed | 10/55 [00:11<00:51,  1.15s/it]
Loading safetensors checkpoint shards:  20% Completed | 11/55 [00:12<00:50,  1.14s/it]
Loading safetensors checkpoint shards:  22% Completed | 12/55 [00:13<00:49,  1.14s/it]
Loading safetensors checkpoint shards:  24% Completed | 13/55 [00:14<00:46,  1.10s/it]
Loading safetensors checkpoint shards:  25% Completed | 14/55 [00:16<00:45,  1.10s/it]
Loading safetensors checkpoint shards:  27% Completed | 15/55 [00:17<00:44,  1.12s/it]
Loading safetensors checkpoint shards:  29% Completed | 16/55 [00:18<00:43,  1.12s/it]
Loading safetensors checkpoint shards:  31% Completed | 17/55 [00:19<00:43,  1.13s/it]
Loading safetensors checkpoint shards:  33% Completed | 18/55 [00:20<00:42,  1.15s/it]
Loading safetensors checkpoint shards:  35% Completed | 19/55 [00:21<00:41,  1.16s/it]
Loading safetensors checkpoint shards:  36% Completed | 20/55 [00:23<00:40,  1.16s/it]
Loading safetensors checkpoint shards:  38% Completed | 21/55 [00:24<00:39,  1.17s/it]
Loading safetensors checkpoint shards:  40% Completed | 22/55 [00:25<00:38,  1.16s/it]
Loading safetensors checkpoint shards:  42% Completed | 23/55 [00:26<00:36,  1.15s/it]
Loading safetensors checkpoint shards:  44% Completed | 24/55 [00:27<00:35,  1.15s/it]
Loading safetensors checkpoint shards:  45% Completed | 25/55 [00:28<00:34,  1.14s/it]
Loading safetensors checkpoint shards:  47% Completed | 26/55 [00:29<00:32,  1.12s/it]
Loading safetensors checkpoint shards:  49% Completed | 27/55 [00:30<00:31,  1.11s/it]
Loading safetensors checkpoint shards:  51% Completed | 28/55 [00:32<00:29,  1.11s/it]
Loading safetensors checkpoint shards:  53% Completed | 29/55 [00:33<00:28,  1.11s/it]
Loading safetensors checkpoint shards:  55% Completed | 30/55 [00:34<00:28,  1.12s/it]
Loading safetensors checkpoint shards:  56% Completed | 31/55 [00:35<00:27,  1.13s/it]
Loading safetensors checkpoint shards:  58% Completed | 32/55 [00:36<00:26,  1.15s/it]
Loading safetensors checkpoint shards:  60% Completed | 33/55 [00:37<00:25,  1.14s/it]
Loading safetensors checkpoint shards:  62% Completed | 34/55 [00:38<00:23,  1.12s/it]
Loading safetensors checkpoint shards:  64% Completed | 35/55 [00:39<00:22,  1.11s/it]
Loading safetensors checkpoint shards:  65% Completed | 36/55 [00:41<00:21,  1.14s/it]
Loading safetensors checkpoint shards:  67% Completed | 37/55 [00:42<00:20,  1.15s/it]
Loading safetensors checkpoint shards:  69% Completed | 38/55 [00:43<00:19,  1.13s/it]
Loading safetensors checkpoint shards:  71% Completed | 39/55 [00:44<00:18,  1.13s/it]
Loading safetensors checkpoint shards:  73% Completed | 40/55 [00:45<00:16,  1.11s/it]
Loading safetensors checkpoint shards:  75% Completed | 41/55 [00:46<00:15,  1.10s/it]
Loading safetensors checkpoint shards:  76% Completed | 42/55 [00:47<00:14,  1.10s/it]
Loading safetensors checkpoint shards:  78% Completed | 43/55 [00:48<00:13,  1.10s/it]
Loading safetensors checkpoint shards:  80% Completed | 44/55 [00:49<00:12,  1.11s/it]
Loading safetensors checkpoint shards:  82% Completed | 45/55 [00:51<00:11,  1.13s/it]
Loading safetensors checkpoint shards:  84% Completed | 46/55 [00:52<00:10,  1.11s/it]
Loading safetensors checkpoint shards:  85% Completed | 47/55 [00:53<00:08,  1.12s/it]
Loading safetensors checkpoint shards:  87% Completed | 48/55 [00:54<00:07,  1.13s/it]
Loading safetensors checkpoint shards:  89% Completed | 49/55 [00:55<00:06,  1.14s/it]
Loading safetensors checkpoint shards:  91% Completed | 50/55 [00:56<00:05,  1.12s/it]
Loading safetensors checkpoint shards:  93% Completed | 51/55 [00:57<00:04,  1.13s/it]
Loading safetensors checkpoint shards:  95% Completed | 52/55 [00:59<00:03,  1.14s/it]
Loading safetensors checkpoint shards:  96% Completed | 53/55 [01:00<00:02,  1.15s/it]
Loading safetensors checkpoint shards:  98% Completed | 54/55 [01:01<00:01,  1.15s/it]
(RayWorkerWrapper pid=77774) INFO 07-28 13:56:37 model_runner.py:692] Loading model weights took 28.7795 GB
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:32 utils.py:784] Found nccl from library libnccl.so.2 [repeated 14x across cluster]
(RayWorkerWrapper pid=25062, ip=10.0.128.18) INFO 07-28 13:55:32 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) WARNING 07-28 13:55:33 custom_all_reduce.py:69] Custom allreduce is disabled because this process group spans across nodes. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 model_runner.py:680] Starting to load model deepseek-ai/DeepSeek-Coder-V2-Instruct... [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. [repeated 15x across cluster]
(RayWorkerWrapper pid=77774) INFO 07-28 13:55:33 selector.py:54] Using XFormers backend. [repeated 15x across cluster]
(RayWorkerWrapper pid=77774) Cache shape torch.Size([163840, 64]) [repeated 14x across cluster]
(RayWorkerWrapper pid=77389) INFO 07-28 13:55:35 weight_utils.py:223] Using model weights format ['*.safetensors'] [repeated 14x across cluster]
Loading safetensors checkpoint shards: 100% Completed | 55/55 [01:02<00:00,  1.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 55/55 [01:02<00:00,  1.14s/it]

INFO 07-28 13:56:38 model_runner.py:692] Loading model weights took 28.6821 GB
(RayWorkerWrapper pid=77233) INFO 07-28 13:56:42 model_runner.py:692] Loading model weights took 28.7795 GB [repeated 5x across cluster]
(RayWorkerWrapper pid=24677, ip=10.0.128.18) INFO 07-28 13:57:02 model_runner.py:692] Loading model weights took 28.7795 GB [repeated 2x across cluster]
ERROR 07-28 13:57:06 worker_base.py:382] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
ERROR 07-28 13:57:06 worker_base.py:382] Traceback (most recent call last):
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 07-28 13:57:06 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
ERROR 07-28 13:57:06 worker_base.py:382]     self.model_runner.profile_run()
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
ERROR 07-28 13:57:06 worker_base.py:382]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-28 13:57:06 worker_base.py:382]     return func(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_or_intermediate_states = model_executable(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 454, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 421, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states, residual = layer(positions, hidden_states,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 379, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     hidden_states = self.mlp(hidden_states)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 139, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     final_hidden_states = self.experts(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return self._call_impl(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-28 13:57:06 worker_base.py:382]     return forward_call(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 250, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     final_hidden_states = self.quant_method.apply(
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 75, in apply
ERROR 07-28 13:57:06 worker_base.py:382]     return self.forward(x, layer.w13_weight, layer.w2_weight,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
ERROR 07-28 13:57:06 worker_base.py:382]     return self._forward_method(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 92, in forward_cuda
ERROR 07-28 13:57:06 worker_base.py:382]     return fused_moe(x,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 613, in fused_moe
ERROR 07-28 13:57:06 worker_base.py:382]     return fused_experts(hidden_states,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 511, in fused_experts
ERROR 07-28 13:57:06 worker_base.py:382]     moe_align_block_size(curr_topk_ids, config['BLOCK_SIZE_M'], E))
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 218, in moe_align_block_size
ERROR 07-28 13:57:06 worker_base.py:382]     ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 34, in wrapper
ERROR 07-28 13:57:06 worker_base.py:382]     return fn(*args, **kwargs)
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 395, in moe_align_block_size
ERROR 07-28 13:57:06 worker_base.py:382]     torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
ERROR 07-28 13:57:06 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 854, in __call__
ERROR 07-28 13:57:06 worker_base.py:382]     return self_._op(*args, **(kwargs or {}))
ERROR 07-28 13:57:06 worker_base.py:382] RuntimeError: CUDA error: invalid argument
ERROR 07-28 13:57:06 worker_base.py:382] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
...
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] RuntimeError: CUDA error: invalid argument [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [repeated 14x across cluster]
(RayWorkerWrapper pid=77774) ERROR 07-28 13:57:06 worker_base.py:382] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [repeated 14x across cluster]```
onlybeyou commented 2 months ago

完全一样的问题,且观察到GPU利用只有50%