THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
5.18k stars 429 forks source link

运行openai_api_server.py 显示内存不足 #173

Closed lesrose closed 4 months ago

lesrose commented 4 months ago

System Info / 系統信息

Cuda 12.3 python 3.11.5 centos 7 p40 显卡三张

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

安装依赖后,执行 python openai_api_server.py image (chatglm4) root@zhangmen:/data/xinkai_hu/model/GLM-4/basic_demo# python openai_api_server.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-14 10:55:15 config.py:1086] Casting torch.bfloat16 to torch.float16. INFO 06-14 10:55:15 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/data/xinkai_hu/model/glm-4-9b-chat', speculative_config=None, tokenizer='/data/xinkai_hu/model/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/xinkai_hu/model/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-14 10:55:16 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 06-14 10:55:16 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 INFO 06-14 10:55:16 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 06-14 10:55:16 selector.py:32] Using XFormers backend. rank0: Traceback (most recent call last): rank0: File "/data/xinkai_hu/model/GLM-4/basic_demo/openai_api_server.py", line 547, in rank0: engine = AsyncLLMEngine.from_engine_args(engine_args)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args rank0: engine = cls(

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 324, in init rank0: self.engine = self._init_engine(*args, **kwargs)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine rank0: return engine_class(*args, **kwargs)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in init rank0: self.model_executor = executor_class(

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/worker/worker.py", line 118, in load_model

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 164, in load_model rank0: self.model = get_model(

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model rank0: return loader.load_model(model_config=model_config,

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 222, in load_model rank0: model = _initialize_model(model_config, self.load_config,

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 88, in _initialize_model rank0: return model_class(config=model_config.hf_config,

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 343, in init rank0: self.transformer = ChatGLMModel(config, quant_config)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 295, in init rank0: self.encoder = GLMTransformer(config, quant_config)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 250, in init rank0: [GLMBlock(config, quant_config) for i in range(self.num_layers)])

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 250, in rank0: [GLMBlock(config, quant_config) for i in range(self.num_layers)])

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 192, in init rank0: self.mlp = GLMMLP(config, quant_config)

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 144, in init rank0: self.dense_4h_to_h = RowParallelLinear(

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 633, in init

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 81, in create_weights rank0: weight = Parameter(torch.empty(output_size_per_partition,

rank0: File "/home/admin/anaconda3/envs/chatglm4/lib/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ rank0: return func(*args, **kwargs)

rank0: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU

Expected behavior / 期待表现

python trans_cli_demo.py 是可以正常使用的,python openai_api_server.py显示有问题

lesrose commented 4 months ago

python trans_cli_demo.py 没有问题 image

lesrose commented 4 months ago

是显卡的配置不够吗,还是要调整参数 engine_args = AsyncEngineArgs( model=MODEL_PATH, tokenizer=MODEL_PATH, tensor_parallel_size=1, dtype="bfloat16", trust_remote_code=True,

占用显存的比例,请根据你的显卡显存大小设置合适的值,例如,如果你的显卡有80G,您只想使用24G,请按照24/80=0.3设置

    gpu_memory_utilization=0.9,
    enforce_eager=True,
    worker_use_ray=False,
    engine_use_ray=False,
    disable_log_requests=True,
    max_model_len=MAX_MODEL_LENGTH,
)
zRzRzRzRzRzRzR commented 4 months ago

tensor_parallel_size=1, 改成3就是三张卡了

lesrose commented 4 months ago

可以指定使用哪几个Gpu吗@zRzRzRzRzRzRzR

zRzRzRzRzRzRzR commented 4 months ago

cuda visiable

lesrose commented 4 months ago

thx,可以了