firfly rk3588 rkllmcpu占用率问题

airockchip / rknn-llm

Other

417 stars 36 forks source link

firfly rk3588 rkllmcpu占用率问题 #27

Open Caical opened 7 months ago

Caical commented 7 months ago

我在firefly rk3588上跑qwen1.8b的模型。cpu的占用率极高，问答速度也稍慢，请问这个现象是正常的吗？

fydeos-alex commented 7 months ago

If you have more than 4 CPU cores on your board, 109% means the model costs about one core to run which is acceptable. It will cost more if you try to use 3 NPU cores. I guess the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM.

Caical commented 7 months ago

But I see that the board configuration from other manufacturers is consistent, the Q&A speed is very fast, and the CPU usage rate is only 50%,

fydeos-alex commented 7 months ago

I don't know the other board configurations clearly, but your usage state is almost the same as mine. Could you please give out more information about the faster examples, so I can help you better?

Caical commented 7 months ago

After setting my CPU and NPU to fixed frequency, the speed significantly improved and the CPU usage was normal

fydeos-alex commented 7 months ago

That was awesome! Would you mind sharing with me your setting methods? I'd really appreciate it.

Caical commented 7 months ago

My board model is firefly ROC-RK3588-PC, and the setting method is as follows cpu：

echo performance | tee $(ls /sys/bus/cpu/devices/cpu*/cpufreq/scaling_governor)

npu：

echo performance > /sys/class/devfreq/fdab0000.npu/governor

And I am using three NPU cores

Caical commented 6 months ago

If you have more than 4 CPU cores on your board, 109% means the model costs about one core to run which is acceptable. It will cost more if you try to use 3 NPU cores. I guess the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM.

When I only changed the NPU running mode to userspaces, the Q&A speed did not improve. But when I changed the CPU to userspaces and increased the main frequency, the performance improved to 21 tokens/s why the bottleneck of rknn llm is on the CPU.

fydeos-alex commented 6 months ago

Hi, there. Just as I said before:

the data copy between cpu and npu causes this cost. RK had the zero-copy API for RKToolkit 2, but not for RKLLM

You can check this https://github.com/airockchip/rknn-toolkit2/blob/master/doc/02_Rockchip_RKNPU_User_Guide_RKNN_SDK_V2.0.0beta0_EN.pdf. I think it will help to understand the principle of how RK uses its NPU and CPU.

yotofu commented 5 months ago

why rkllm has no zero-copy API? Is this a feature in the future version?