about LMDeploy delivers up to 1.8x higher request throughput than vLLM

tricky61 commented 3 months ago

which version of vllm do you use? and does the vllm use of cudagraph?

zhyncs commented 3 months ago

I recently tested https://huggingface.co/THUDM/glm-4-9b-chat. Both vLLM and LMDeploy are using the latest versions with default startup parameters, and the maximum RPS is also close to 1.8 times. I hope this information is helpful to you.

zhyncs commented 3 months ago

In my opinion, currently vLLM only has an advantage over LMDeploy in one scenario: if you are running MOE models like Mixtral on H100 and also want to use the fp8 feature. In that case, I recommend that you may consider trying vLLM. Otherwise, it's not highly recommended.

tricky61 commented 3 months ago

In my opinion, currently vLLM only has an advantage over LMDeploy in one scenario: if you are running MOE models like Mixtral on H100 and also want to use the fp8 feature. In that case, I recommend that you may consider trying vLLM. Otherwise, it's not highly recommended.

many thanks for your reply. it seems vllm test not using cudagraph since the https://huggingface.co/THUDM/glm-4-9b-chat use parameter enforce_eager=True. and cudagraph will bring at least 2x speed up. I am using the vllm deploying a llama style model no quantization. seems cannot speed up futher. I will try to test with Imdeploy.

zhyncs commented 3 months ago

and cudagraph will bring at least 2x speed up

Can you share how to enable CUDA Graph when using vLLM inference on GLM 4 9B Chat? I can do some verification locally. Thanks.

zhyncs commented 3 months ago

It seems that the default setting has enabled CUDA Graph.

[root@hostname workdir]# python3 -m vllm.entrypoints.openai.api_server --model /workdir/glm-4-9b-chat --trust-remote-code
INFO 07-12 12:00:14 api_server.py:206] vLLM API server version 0.5.1
INFO 07-12 12:00:14 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/glm-4-9b-chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-12 12:00:14 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/glm-4-9b-chat', speculative_config=None, tokenizer='/workdir/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False)

highlight: enforce_eager=False @tricky61

tricky61 commented 3 months ago

and cudagraph will bring at least 2x speed up

Can you share how to enable CUDA Graph when using vLLM inference on GLM 4 9B Chat? I can do some verification locally. Thanks.

change enforce_eager=True to enforce_eager=False will use cudagraph sometimes I mannually set the enforce_eager=False in the config.py

zhyncs commented 3 months ago

Hi @tricky61 The default is false.

tricky61 commented 3 months ago

It seems that the default setting has enabled CUDA Graph.

[root@hostname workdir]# python3 -m vllm.entrypoints.openai.api_server --model /workdir/glm-4-9b-chat --trust-remote-code
INFO 07-12 12:00:14 api_server.py:206] vLLM API server version 0.5.1
INFO 07-12 12:00:14 api_server.py:207] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/workdir/glm-4-9b-chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-12 12:00:14 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/workdir/glm-4-9b-chat', speculative_config=None, tokenizer='/workdir/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/workdir/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False)

highlight: enforce_eager=False @tricky61

It seems used the cudagraph. the log will has the two lines: Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

Is the Imdeploy faster than vllm only in the TurboMind engine or torch engine faster too?

zhyncs commented 3 months ago

the log will has the two lines

Yes. It has.

Is the Imdeploy faster than vllm only in the TurboMind engine or torch engine faster too?

Compared to the PyTorch engine and vLLM, PyTorch Engine also has advantages, just not as obvious as TurboMind. You can test it out in practice.

zhyncs commented 3 months ago

vLLM supports new model progress quickly in terms of model support, but the scheduling overhead of vLLM is relatively high and difficult to optimize. You can refer to SGLang, which has partially reused vLLM in its model support section, referenced LightLLM for scheduling, and used a faster FlashInfer for the Attention part.

tricky61 commented 3 months ago

vLLM supports new model progress quickly in terms of model support, but the scheduling overhead of vLLM is relatively high and difficult to optimize. You can refer to SGLang, which has partially reused vLLM in its model support section, referenced LightLLM for scheduling, and used a faster FlashInfer for the Attention part.

many thanks to your suggestion.

InternLM / lmdeploy

about LMDeploy delivers up to 1.8x higher request throughput than vLLM #2005