Open ucas010 opened 2 weeks ago
感谢
hello , 使用简单的代码也能复现这个bug, from transformers import AutoTokenizer from vllm import LLM, SamplingParams
max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True,
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
请教下咋回事?
v100不支持bf16
请问下有bf8支持么?
dtype 改为float16后出现 ttributeError: '_OpNamespace' '_C' object has no attribute 'rms_norm'
看一下readme按照里面的requirement安装一下环境吧,fp16可以推理但不推荐,可能会出现小问题,最好用bf16
System Info / 系統信息
CUda 11.7 ,Python 3.10.12 gpu V100 32G 显存。 vllm 0.5.4 vllm-flash-attn 2.6.1 其他按照basic_demo里面的requirements安装的。
Who can help? / 谁可以帮助到您?
@wwewwt @Sengxian @davidlvxin @codazzy @
Information / 问题信息
Reproduction / 复现过程
python openai_api_server.py WARNING 08-27 13:59:59 _custom_ops.py:15] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory') INFO 08-27 14:00:04 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', speculative_config=None, tokenizer='/data/ChatGLM-6B/conf/models/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/ChatGLM-6B/conf/models/glm-4-9b-chat, use_v2_block_manager=False, enable_prefix_caching=False) WARNING 08-27 14:00:05 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 08-27 14:00:05 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-27 14:00:05 selector.py:54] Using XFormers backend. WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.4.0) Python 3.10.14 (you have 3.10.12) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details /data/soft/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/data//envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
self.engine = self._init_engine(*args, *kwargs)
File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(args, **kwargs)
File "/data/t/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in init
self.model_executor = executor_class(
File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/data//anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.init_device()
File "/data/st/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/worker/worker.py", line 125, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/data/soft/anaconda3/envs/langchain/lib/python3.10/site-packages/vllm/worker/worker.py", line 358, in _check_if_gpu_supports_dtype
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data/t/anaconda3/envs/langchain/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") Traceback (most recent call last): File "/data/GLM-4/basic_demo/openai_api_server.py", line 683, indtype
flag in CLI, for example: --dtype=half.Expected behavior / 期待表现
希望解决bug,感谢感谢