Closed luguoyixiazi closed 1 month ago
不愧是131072的长度,狠狠爆了一波kvcache……emmm……这下不得不cpu推理了
代码如下:
from transformers import AutoTokenizer from vllm import LLM, SamplingParams max_model_len, tp_size = 131072, 1 model_name = "/models/codegeex4-all-9b" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
vllm的输出:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 07-11 15:51:44 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/models/codegeex4-all-9b', speculative_config=None, tokenizer='/models/codegeex4-all-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/models/codegeex4-all-9b, use_v2_block_manager=False, enable_prefix_caching=False) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-11 15:51:44 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. WARNING 07-11 15:51:44 utils.py:562] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. INFO 07-11 15:55:24 model_runner.py:255] Loading model weights took 17.5635 GB
占用显存40GB,如果用xfomers的backend也一样
我遇到了一样的问题,是max_model_len的原因吗,请问你是如何解决的?
哥,我close了……emmm,我断点了一下,其实主要是在做kv cache,max len设置的太大了所以cache也要做很多,调小maxlen就行
哥,我close了……emmm,我断点了一下,其实主要是在做kv cache,max len设置的太大了所以cache也要做很多,调小maxlen就行
多谢多谢
代码如下:
vllm的输出:
占用显存40GB,如果用xfomers的backend也一样