[Question]: qwen推理显存不足，如何设置多卡推理

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

11.96k stars 2.91k forks source link

[Question]: qwen推理显存不足，如何设置多卡推理 #8298

Closed zhaogf01 closed 4 months ago

zhaogf01 commented 4 months ago

请提出你的问题

这是我的推理的代码，请问如何多卡推理？ from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("qwen/qwen-7b") model = AutoModelForCausalLM.from_pretrained("qwen/qwen-7b", dtype="float32") input_features = tokenizer("hello", return_tensors="pd") outputs = model.generate(**input_features, max_length=128) tokenizer.batch_decode(outputs[0])

w5688414 commented 4 months ago

可以使用recompute和flash attention。如果要多卡推理，需要对参数进行一定的修改，可以参考：

https://github.com/PaddlePaddle/PaddleNLP/blob/40398976f688799b0aed6b9ed725a1c984ef21f9/examples/benchmark/peft/paddle/benchmark.py#L104

然后执行：

python -m paddle.distributed.launch --gpus "0,1,2,3" your_script.py

zhaogf01 commented 4 months ago

可以使用recompute和flash attention。如果要多卡推理，需要对参数进行一定的修改，可以参考：

https://github.com/PaddlePaddle/PaddleNLP/blob/40398976f688799b0aed6b9ed725a1c984ef21f9/examples/benchmark/peft/paddle/benchmark.py#L104

然后执行：
python -m paddle.distributed.launch --gpus "0,1,2,3" your_script.py 
我采用的是4卡推理，tensor_parallel_degree设置的2，请问tensor_parallel_rank应该如何设置？设置为0的时候会有以下报错：

w5688414 commented 4 months ago

可以参考下面：

https://github.com/PaddlePaddle/PaddleNLP/blob/1ffa290841aa5e1b0d1d074e4f69600481ddb1b2/llm/glm/predict_generation.py#L70

zhaogf01 commented 4 months ago

可以参考下面：

https://github.com/PaddlePaddle/PaddleNLP/blob/1ffa290841aa5e1b0d1d074e4f69600481ddb1b2/llm/glm/predict_generation.py#L70

好的，感谢。