THUDM / CodeGeeX4

CodeGeeX4-ALL-9B, a versatile model for all AI software development scenarios, including code completion, code interpreter, web search, function calling, repository-level Q&A and much more.
https://codegeex.cn
Apache License 2.0
1.12k stars 87 forks source link

vllm加载模型之后没推理,一直满GPU占用,是怎么回事? #23

Closed luguoyixiazi closed 1 month ago

luguoyixiazi commented 1 month ago

代码如下:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1
model_name = "/models/codegeex4-all-9b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

vllm的输出:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-11 15:51:44 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/models/codegeex4-all-9b', speculative_config=None, tokenizer='/models/codegeex4-all-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/models/codegeex4-all-9b, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-11 15:51:44 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 07-11 15:51:44 utils.py:562] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 07-11 15:55:24 model_runner.py:255] Loading model weights took 17.5635 GB

占用显存40GB,如果用xfomers的backend也一样

luguoyixiazi commented 1 month ago

不愧是131072的长度,狠狠爆了一波kvcache……emmm……这下不得不cpu推理了

hongshixian commented 1 month ago

代码如下:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1
model_name = "/models/codegeex4-all-9b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

vllm的输出:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-11 15:51:44 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/models/codegeex4-all-9b', speculative_config=None, tokenizer='/models/codegeex4-all-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/models/codegeex4-all-9b, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-11 15:51:44 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 07-11 15:51:44 utils.py:562] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 07-11 15:55:24 model_runner.py:255] Loading model weights took 17.5635 GB

占用显存40GB,如果用xfomers的backend也一样

我遇到了一样的问题,是max_model_len的原因吗,请问你是如何解决的?

luguoyixiazi commented 1 month ago

哥,我close了……emmm,我断点了一下,其实主要是在做kv cache,max len设置的太大了所以cache也要做很多,调小maxlen就行

hongshixian commented 1 month ago

哥,我close了……emmm,我断点了一下,其实主要是在做kv cache,max len设置的太大了所以cache也要做很多,调小maxlen就行

多谢多谢