Closed tricky61 closed 3 months ago
I recently tested https://huggingface.co/THUDM/glm-4-9b-chat. Both vLLM and LMDeploy are using the latest versions with default startup parameters, and the maximum RPS is also close to 1.8 times. I hope this information is helpful to you.
In my opinion, currently vLLM only has an advantage over LMDeploy in one scenario: if you are running MOE models like Mixtral on H100 and also want to use the fp8 feature. In that case, I recommend that you may consider trying vLLM. Otherwise, it's not highly recommended.
In my opinion, currently vLLM only has an advantage over LMDeploy in one scenario: if you are running MOE models like Mixtral on H100 and also want to use the fp8 feature. In that case, I recommend that you may consider trying vLLM. Otherwise, it's not highly recommended.
many thanks for your reply. it seems vllm test not using cudagraph since the https://huggingface.co/THUDM/glm-4-9b-chat use parameter enforce_eager=True. and cudagraph will bring at least 2x speed up. I am using the vllm deploying a llama style model no quantization. seems cannot speed up futher. I will try to test with Imdeploy.
and cudagraph will bring at least 2x speed up
Can you share how to enable CUDA Graph when using vLLM inference on GLM 4 9B Chat? I can do some verification locally. Thanks.
It seems that the default setting has enabled CUDA Graph.
[root@hostname workdir]# python3 -m vllm.entrypoints.openai.api_server --model /workdir/glm-4-9b-chat --trust-remote-code
INFO 07-12 12:00:14 api_server.py:206] vLLM API server version 0.5.1
INFO 07-12 12:00:14 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/glm-4-9b-chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-12 12:00:14 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/glm-4-9b-chat', speculative_config=None, tokenizer='/workdir/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False)
highlight: enforce_eager=False
@tricky61
and cudagraph will bring at least 2x speed up
Can you share how to enable CUDA Graph when using vLLM inference on GLM 4 9B Chat? I can do some verification locally. Thanks.
change enforce_eager=True to enforce_eager=False will use cudagraph sometimes I mannually set the enforce_eager=False in the config.py
Hi @tricky61 The default is false.
It seems that the default setting has enabled CUDA Graph.
[root@hostname workdir]# python3 -m vllm.entrypoints.openai.api_server --model /workdir/glm-4-9b-chat --trust-remote-code INFO 07-12 12:00:14 api_server.py:206] vLLM API server version 0.5.1 INFO 07-12 12:00:14 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/glm-4-9b-chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 07-12 12:00:14 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/glm-4-9b-chat', speculative_config=None, tokenizer='/workdir/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False)
highlight:
enforce_eager=False
@tricky61
It seems used the cudagraph.
the log will has the two lines:
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization
or enforcing eager mode. You can also reduce the max_num_seqs
as needed to decrease memory usage.
Is the Imdeploy faster than vllm only in the TurboMind engine or torch engine faster too?
the log will has the two lines
Yes. It has.
Is the Imdeploy faster than vllm only in the TurboMind engine or torch engine faster too?
Compared to the PyTorch engine and vLLM, PyTorch Engine also has advantages, just not as obvious as TurboMind. You can test it out in practice.
vLLM supports new model progress quickly in terms of model support, but the scheduling overhead of vLLM is relatively high and difficult to optimize. You can refer to SGLang, which has partially reused vLLM in its model support section, referenced LightLLM for scheduling, and used a faster FlashInfer for the Attention part.
vLLM supports new model progress quickly in terms of model support, but the scheduling overhead of vLLM is relatively high and difficult to optimize. You can refer to SGLang, which has partially reused vLLM in its model support section, referenced LightLLM for scheduling, and used a faster FlashInfer for the Attention part.
many thanks to your suggestion.
which version of vllm do you use? and does the vllm use of cudagraph?