PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12k stars 2.93k forks source link

[Question]: 推理时block_attn开启的环境要求是什么 #8692

Closed wojiaoshihua closed 3 months ago

wojiaoshihua commented 3 months ago

请提出你的问题

我在本地部署一个llama模型,但当我开启block_attn时,我并没有办法进行推理,报错信息似乎是不支持flash_attn, 想请问一下,你们成功的环境是什么?我的环境是paddlenlp=3.0.0b0.post0. paddlepaddle-gpu=0.0.0.post112。查了下paddle库,似乎要求CUDA>11.4,下面的报错是因为我的cuda版本过旧了吗

UnimplementedError: FlashAttention is unsupported, please check the GPU compability and CUDA Version. (at ../paddle/phi/kernels/gpu/flash_attn_utils.h:367) [operator < block_multihead_attention > error] Process Process-1: Traceback (most recent call last): File "/root/anaconda3/envs/visual/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/envs/visual/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/paddlejob/workspace/mywork/origin/PaddleNLP/llm/utils/utils.py", line 761, in read_res output_tensor = tensor_queue.get(timeout=1) File "/root/anaconda3/envs/visual/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/root/anaconda3/envs/visual/lib/python3.10/site-packages/paddle/incubate/multiprocessing/reductions.py", line 130, in _rebuild_lodtensor_filename lodtensor = cls._new_shared_filename( RuntimeError: (Unavailable) File descriptor /paddle_25505_0_3639094214 open failed, unable in read-write mode [Hint: Expected fd != -1, but received fd:-1 == -1:-1.] (at ../paddle/fluid/memory/allocation/mmap_allocator.cc:85)

DrownFish19 commented 3 months ago

FlashAttention需要CUDA 11.4+, sm80以上计算卡,细节可参考链接

wojiaoshihua commented 3 months ago

FlashAttention需要CUDA 11.4+, sm80以上计算卡,细节可参考链接

好的,谢谢