THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
3.93k stars 286 forks source link

无法使用vllm #274

Closed itbithubman closed 3 weeks ago

itbithubman commented 3 weeks ago

System Info / 系統信息

ubuntu22 conda python3.11 nvidia-cudnn-cu12 torch 2.3.0 vllm 0.5.0.post1 vllm-flash-attn 2.5.9 xformers 0.0.26.post1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

python3 basic_demo/vllm_cli_demo.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-02 12:10:49 config.py:1222] Casting torch.bfloat16 to torch.float16. 2024-07-02 12:10:52,580 INFO worker.py:1771 -- Started a local Ray instance. INFO 07-02 12:10:53 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=THUDM/glm-4-9b-chat) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 07-02 12:10:54 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 07-02 12:10:57 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-02 12:10:57 selector.py:51] Using XFormers backend. INFO 07-02 12:10:58 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 07-02 12:10:58 selector.py:51] Using XFormers backend. INFO 07-02 12:11:06 model_runner.py:160] Loading model weights took 17.5635 GB ERROR 07-02 12:11:07 worker_base.py:148] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. ERROR 07-02 12:11:07 worker_base.py:148] Traceback (most recent call last): ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 140, in execute_method ERROR 07-02 12:11:07 worker_base.py:148] return executor(*args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 07-02 12:11:07 worker_base.py:148] return func(*args, *kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks ERROR 07-02 12:11:07 worker_base.py:148] self.model_runner.profile_run() ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 07-02 12:11:07 worker_base.py:148] return func(args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 844, in profile_run ERROR 07-02 12:11:07 worker_base.py:148] self.execute_model(seqs, kv_caches) ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 07-02 12:11:07 worker_base.py:148] return func(*args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 749, in execute_model ERROR 07-02 12:11:07 worker_base.py:148] hidden_states = model_executable( ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(*args, *kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 364, in forward ERROR 07-02 12:11:07 worker_base.py:148] hidden_states = self.transformer(input_ids, positions, kv_caches, ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(*args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(*args, *kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 316, in forward ERROR 07-02 12:11:07 worker_base.py:148] hidden_states = self.encoder( ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(*args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 272, in forward ERROR 07-02 12:11:07 worker_base.py:148] hidden_states = layer( ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(*args, *kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 207, in forward ERROR 07-02 12:11:07 worker_base.py:148] attention_output = self.self_attention( ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(*args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(*args, *kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/models/chatglm.py", line 103, in forward ERROR 07-02 12:11:07 workerbase.py:148] qkv, = self.query_key_value(hidden_states) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl ERROR 07-02 12:11:07 worker_base.py:148] return self._call_impl(args, kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl ERROR 07-02 12:11:07 worker_base.py:148] return forward_call(*args, **kwargs) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 298, in forward ERROR 07-02 12:11:07 worker_base.py:148] output_parallel = self.quantmethod.apply(self, input, bias) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] File "/home/tt/anaconda3/envs/glm4/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 111, in apply ERROR 07-02 12:11:07 worker_base.py:148] return F.linear(x, weight, bias) ERROR 07-02 12:11:07 worker_base.py:148] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-02 12:11:07 worker_base.py:148] RuntimeError: CUDA error: no kernel image is available for execution on the device ERROR 07-02 12:11:07 worker_base.py:148] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ERROR 07-02 12:11:07 worker_base.py:148] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. ERROR 07-02 12:11:07 worker_base.py:148] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Expected behavior / 期待表现

使用vllm都会报错,其他的实例都是正常运行 pip安装了很多,我跳了一些写上去,要是不全我再补上

zRzRzRzRzRzRzR commented 3 weeks ago

Cannot use FlashAttention-2 backend for Volta and Turing GPUs. GPU版本太老了,请使用 安培架构以上的GPU (3090以上) 不然就不要用 FlashAttention-2 b

itbithubman commented 3 weeks ago

可能是自动安装的,要如何去掉,有没有命令什么的

zRzRzRzRzRzRzR commented 3 weeks ago

本来就没安装,是不是你启动自己带了,你pip uninstall flash-attn

itbithubman commented 3 weeks ago

没有安装过,我看到vllm-flash-attn-2.5.9,好像是vllm自己安装的,但是删了也是一样的错误,有没有替代vllm的

zRzRzRzRzRzRzR commented 3 weeks ago

openai api都是vLLM的底座,你可以根据composite demo 自己改成transformers demo做

itbithubman commented 3 weeks ago

好的,谢谢。 我安装了Xinference ,在安装的时候由于版本问题flash-attn没有自动安装,但是可以直接用openai接口。