Open Caical opened 7 months ago
If you have more than 4 CPU cores on your board, 109% means the model costs about one core to run which is acceptable. It will cost more if you try to use 3 NPU cores. I guess the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM.
But I see that the board configuration from other manufacturers is consistent, the Q&A speed is very fast, and the CPU usage rate is only 50%,
I don't know the other board configurations clearly, but your usage state is almost the same as mine. Could you please give out more information about the faster examples, so I can help you better?
After setting my CPU and NPU to fixed frequency, the speed significantly improved and the CPU usage was normal
That was awesome! Would you mind sharing with me your setting methods? I'd really appreciate it.
My board model is firefly ROC-RK3588-PC, and the setting method is as follows cpu:
echo performance | tee $(ls /sys/bus/cpu/devices/cpu*/cpufreq/scaling_governor)
npu:
echo performance > /sys/class/devfreq/fdab0000.npu/governor
And I am using three NPU cores
If you have more than 4 CPU cores on your board, 109% means the model costs about one core to run which is acceptable. It will cost more if you try to use 3 NPU cores. I guess the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM.
When I only changed the NPU running mode to userspaces, the Q&A speed did not improve. But when I changed the CPU to userspaces and increased the main frequency, the performance improved to 21 tokens/s why the bottleneck of rknn llm is on the CPU.
Hi, there. Just as I said before:
the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM
You can check this https://github.com/airockchip/rknn-toolkit2/blob/master/doc/02_Rockchip_RKNPU_User_Guide_RKNN_SDK_V2.0.0beta0_EN.pdf. I think it will help to understand the principle of how RK uses its NPU and CPU.
why rkllm has no zero-copy API? Is this a feature in the future version?
我在firefly rk3588上跑qwen1.8b的模型。cpu的占用率极高,问答速度也稍慢,请问这个现象是正常的吗?