无法重现Stable Diffusion推理速度

环境

【版本】： fastdeploy-gpu-python 1.0.5 paddle-bfloat 0.1.7 paddle2onnx 1.0.6 paddlefsl 1.1.0 paddlenlp 2.5.2 paddlepaddle-gpu 0.0.0.post117 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 python 3.10
【系统平台】: Linux x64(Ubuntu 20.04)
【硬件】： A100-40GB

问题日志及出现问题的操作流程

模型从https://bj.bcebos.com/fastdeploy/models/stable-diffusion/runwayml/stable-diffusion-v1-5.tgz 下载的

··· /FastDeploy/examples/multimodal/stable_diffusion$ python infer.py --model_dir stable-diffusion-v1-5/ --scheduler "euler_ancestral" --backend paddle --inference_steps 50 [2023-04-04 07:24:35,868] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/vocab.json [2023-04-04 07:24:35,869] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/merges.txt [2023-04-04 07:24:35,869] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/added_tokens.json and saved to /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14 [2023-04-04 07:24:36,849] [ WARNING] - filehttps://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/added_tokens.json not exist [2023-04-04 07:24:36,849] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/special_tokens_map.json [2023-04-04 07:24:36,850] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/tokenizer_config.json [INFO] fastdeploy/runtime/runtime.cc(293)::CreateOrtBackendRuntime initialized with Backend::ORT in Device::GPU. [INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackendRuntime initialized with Backend::PDINFER in Device::GPU. [INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackendRuntime initialized with Backend::PDINFER in Device::GPU. Spend 11.65 s to load unet model. Run the stable diffusion pipeline 1 times to test the performance. No 0 time cost: 4.535359 s Mean latency: 4.535359 s, p50 latency: 4.535359 s, p90 latency: 4.535359 s, p95 latency: 4.535359 s.

wget https://bj.bcebos.com/fastdeploy/models/stable-diffusion/CompVis/stable-diffusion-v1-4.tgz /FastDeploy/examples/multimodal/stable_diffusion$tar -xvzf stable-diffusion-v1-4.tgz /FastDeploy/examples/multimodal/stable_diffusion$python infer.py --model_dir stable-diffusion-v1-4/ --backend paddle --inference_steps 50 --use_fp16 1 --scheduler pndm --benchmark_steps 10 [2023-04-04 07:56:04,939] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/vocab.json [2023-04-04 07:56:04,939] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/merges.txt [2023-04-04 07:56:04,939] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/added_tokens.json and saved to /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14 [2023-04-04 07:56:05,907] [ WARNING] - filehttps://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/added_tokens.json not exist [2023-04-04 07:56:05,908] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/special_tokens_map.json [2023-04-04 07:56:05,908] [ INFO] - Already cached /home/turinguser/.paddlenlp/models/openai/clip-vit-large-patch14/tokenizer_config.json [INFO] fastdeploy/runtime/runtime.cc(293)::CreateOrtBackendRuntime initialized with Backend::ORT in Device::GPU. [INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackendRuntime initialized with Backend::PDINFER in Device::GPU. wget https://bj.bcebos.com/fastdeploy/models/stable-diffusion/CompVis/stable-diffusion-v1-4.tgz[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackendRuntime initialized with Backend::PDINFER in Device::GPU. Spend 11.83 s to load unet model. Run the stable diffusion pipeline 5 times to test the performance. No 0 time cost: 4.607649 s No 1 time cost: 4.613580 s No 2 time cost: 4.601662 s No 3 time cost: 4.602602 s No 4 time cost: 4.606175 s Mean latency: 4.606334 s, p50 latency: 4.606175 s, p90 latency: 4.611208 s, p95 latency: 4.612394 s. Image saved in fd_astronaut_rides_horse.png! ···

速度大概是4.5~4.6秒，比Blog (https://blog.csdn.net/PaddlePaddle/article/details/129426638) 上的0.76秒差太远。

PaddlePaddle / FastDeploy

无法重现Stable Diffusion推理速度 #1764

环境

问题日志及出现问题的操作流程