PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.96k stars 2.91k forks source link

[Question]: qwen推理显存不足,如何设置多卡推理 #8298

Closed zhaogf01 closed 4 months ago

zhaogf01 commented 4 months ago

请提出你的问题

这是我的推理的代码,请问如何多卡推理? from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("qwen/qwen-7b") model = AutoModelForCausalLM.from_pretrained("qwen/qwen-7b", dtype="float32") input_features = tokenizer("hello", return_tensors="pd") outputs = model.generate(**input_features, max_length=128) tokenizer.batch_decode(outputs[0])

w5688414 commented 4 months ago

可以使用recompute和flash attention。如果要多卡推理,需要对参数进行一定的修改,可以参考:

https://github.com/PaddlePaddle/PaddleNLP/blob/40398976f688799b0aed6b9ed725a1c984ef21f9/examples/benchmark/peft/paddle/benchmark.py#L104

然后执行:

python -m paddle.distributed.launch --gpus "0,1,2,3" your_script.py 
zhaogf01 commented 4 months ago

可以使用recompute和flash attention。如果要多卡推理,需要对参数进行一定的修改,可以参考:

https://github.com/PaddlePaddle/PaddleNLP/blob/40398976f688799b0aed6b9ed725a1c984ef21f9/examples/benchmark/peft/paddle/benchmark.py#L104

然后执行:

python -m paddle.distributed.launch --gpus "0,1,2,3" your_script.py 

我采用的是4卡推理,tensor_parallel_degree设置的2,请问tensor_parallel_rank应该如何设置?设置为0的时候会有以下报错: image

w5688414 commented 4 months ago

可以参考下面:

https://github.com/PaddlePaddle/PaddleNLP/blob/1ffa290841aa5e1b0d1d074e4f69600481ddb1b2/llm/glm/predict_generation.py#L70

zhaogf01 commented 4 months ago

可以参考下面:

https://github.com/PaddlePaddle/PaddleNLP/blob/1ffa290841aa5e1b0d1d074e4f69600481ddb1b2/llm/glm/predict_generation.py#L70

好的,感谢。