Open 0312birdzhang opened 5 months ago
Which model did you infer with? Are using using all the NPU cores?
supercombo @master 341b81c0a5c5f378f749a8ba07514f48f0f9faa5. The latest model is taking me 100ms to run. Haven't had time to check the older models. Accuracy is bad, not sure if it's my implementation
Which model did you infer with? Are using using all the NPU cores?
https://github.com/commaai/openpilot/commit/531c13f2acd8d15914aba34a40b356dcb7f02ced
Yes, with RKNNLite.NPU_CORE_0_1_2
.
supercombo @master 341b81c0a5c5f378f749a8ba07514f48f0f9faa5. The latest model is taking me 100ms to run. Haven't had time to check the older models. Accuracy is bad, not sure if it's my implementation
The latest model is 100ms too. I'm thinking use C++ rknn_api, but i'm a Java/Python developer, it's hard for me.
I am already running using the C++ rknn_api, 10ms savings for the latest model compared to using python, not worth. I think we need to look at the underlying issue. Get the performance of each op type.
I just did a quick test with the older models. v0.8.13 runs at 40ms single core, lemon pie model runs at 70ms single core, and the latest model certified herbalist model runs at 110ms single core
I have also tested it, but the NPU usage is very low, and it takes about 50ms to infer one frame.
Also what dev platform are you running on? OS? Are you able to do a perf_debug?
Also what dev platform are you running on? OS? Are you able to do a perf_debug?
Same as this project. https://github.com/Joshua-Riek/ubuntu-rockchip
The perf log: https://gist.github.com/0312birdzhang/810b1fd40427484f7167e9bd4bf4f1cf
Oh thank you. Looks like the reshape is taking bulk of the time moving things around from the NPU to the CPU. Maybe we can remove reshape and find a substitute for it.
By the way, how are you getting the perf log? I don't quite understand how does adb work between a linux host and a linux client.
By the way, how are you getting the perf log?
Search rknn.eval_perf()
in 01_Rockchip_RKNPU_Quick_Start_RKNN_SDK_V1.6.0_EN.pdf, the python script runs on Host and the dev board plays as a client via adb.
I don't quite understand how does adb work between a linux host and a linux client.
I'm using Orangepi 5, and need to flash the stock orangepi ubuntu(you can burn to sdcard), otherwise adb not works.
latest supercombo model. Pow is the bottleneck
After removing pow from the onnx graph and running the perf again. 34ms
34ms is really good, we need find some solution to replace Pow.
34ms is really good, we need find some solution to replace Pow.
Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?
Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?
I'm a newbie on this, i asked on RKNN group, they said you can take a look at the rknn_matmul_api_demo, it can runs on GPU.
Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?
I'm a newbie on this, i asked on RKNN group, they said you can take a look at the rknn_matmul_api_demo, it can runs on GPU.
Cool stuff, yeah I am already checking matmul. Need sometime to digest and understand. I can't join the qq group because my number is somehow barred. Could you also ask them if they plan to support more op types in the near future?
Officials never make promises, I will continue to pay attention to this
with 2.0.0b0 released, Moonrise Model perf logs:
---------------------------------------------------------------------------------------------------
Operator Time Consuming Ranking Table
---------------------------------------------------------------------------------------------------
OpType CallNumber CPUTime(us) GPUTime(us) NPUTime(us) TotalTime(us) TimeRatio(%)
---------------------------------------------------------------------------------------------------
ConvElu 79 0 0 14399 14399 64.09%
ConvRelu 42 0 0 2492 2492 11.09%
ConvAdd 25 0 0 1798 1798 8.00%
Conv 32 0 0 1717 1717 7.64%
ConvAddRelu 23 0 0 1278 1278 5.69%
Reshape 22 430 0 0 430 1.91%
Concat 4 0 0 172 172 0.77%
Div 1 85 0 0 85 0.38%
Sqrt 1 33 0 0 33 0.15%
Mul 1 0 0 21 21 0.09%
Clip 1 0 0 16 16 0.07%
Gather 1 12 0 0 12 0.05%
InputOperator 7 9 0 0 9 0.04%
OutputOperator 1 4 0 0 4 0.02%
---------------------------------------------------------------------------------------------------
With 0.9.5 model:
---------------------------------------------------------------------------------------------------
Operator Time Consuming Ranking Table
---------------------------------------------------------------------------------------------------
OpType CallNumber CPUTime(us) GPUTime(us) NPUTime(us) TotalTime(us) TimeRatio(%)
---------------------------------------------------------------------------------------------------
Pow 19 88001 0 0 88001 64.13%
Conv 72 0 0 12220 12220 8.90%
Mul 79 0 0 11523 11523 8.40%
Add 40 0 0 5791 5791 4.22%
ConvRelu 44 0 0 5231 5231 3.81%
Tanh 19 0 0 3097 3097 2.26%
ConvAdd 14 0 0 2876 2876 2.10%
ConvAddRelu 24 0 0 2621 2621 1.91%
Concat 5 2098 0 280 2378 1.73%
Reshape 33 1529 0 191 1720 1.25%
Transpose 4 0 0 356 356 0.26%
MatMul 2 347 0 0 347 0.25%
exLayerNorm 2 0 0 282 282 0.21%
Split 2 0 0 208 208 0.15%
Elu 1 0 0 169 169 0.12%
exSoftmax13 1 0 0 160 160 0.12%
ConvSigmoid 1 0 0 75 75 0.05%
Div 1 70 0 0 70 0.05%
Clip 1 0 0 64 64 0.05%
Gather 1 18 0 0 18 0.01%
Sqrt 1 11 0 0 11 0.01%
InputOperator 8 8 0 0 8 0.01%
OutputOperator 1 4 0 0 4 0.00%
---------------------------------------------------------------------------------------------------
with 2.0.0b0 released, Moonrise Model perf logs:
--------------------------------------------------------------------------------------------------- Operator Time Consuming Ranking Table --------------------------------------------------------------------------------------------------- OpType CallNumber CPUTime(us) GPUTime(us) NPUTime(us) TotalTime(us) TimeRatio(%) --------------------------------------------------------------------------------------------------- ConvElu 79 0 0 14399 14399 64.09% ConvRelu 42 0 0 2492 2492 11.09% ConvAdd 25 0 0 1798 1798 8.00% Conv 32 0 0 1717 1717 7.64% ConvAddRelu 23 0 0 1278 1278 5.69% Reshape 22 430 0 0 430 1.91% Concat 4 0 0 172 172 0.77% Div 1 85 0 0 85 0.38% Sqrt 1 33 0 0 33 0.15% Mul 1 0 0 21 21 0.09% Clip 1 0 0 16 16 0.07% Gather 1 12 0 0 12 0.05% InputOperator 7 9 0 0 9 0.04% OutputOperator 1 4 0 0 4 0.02% ---------------------------------------------------------------------------------------------------
This one looks promising. Were there any difference before the 2.0.0beta update?
Were there any difference before the 2.0.0beta update?
Yes, Reshape use less CPU times.
---------------------------------------------------------------------------------------------------
Operator Time Consuming Ranking Table
---------------------------------------------------------------------------------------------------
OpType CallNumber CPUTime(us) GPUTime(us) NPUTime(us) TotalTime(us) TimeRatio(%)
---------------------------------------------------------------------------------------------------
Reshape 22 32473 0 0 32473 44.63%
ConvElu 79 0 0 20315 20315 27.92%
ConvRelu 42 0 0 5827 5827 8.01%
Conv 32 0 0 5113 5113 7.03%
ConvAdd 25 0 0 3280 3280 4.51%
ConvAddRelu 23 0 0 3187 3187 4.38%
Div 1 925 0 0 925 1.27%
Sqrt 1 620 0 0 620 0.85%
Concat 4 0 0 507 507 0.70%
Gather 1 316 0 0 316 0.43%
Clip 1 0 0 88 88 0.12%
Mul 1 0 0 88 88 0.12%
InputOperator 7 21 0 0 21 0.03%
OutputOperator 1 6 0 0 6 0.01%
---------------------------------------------------------------------------------------------------
Were there any difference before the 2.0.0beta update?
Yes, Reshape use less CPU times.
--------------------------------------------------------------------------------------------------- Operator Time Consuming Ranking Table --------------------------------------------------------------------------------------------------- OpType CallNumber CPUTime(us) GPUTime(us) NPUTime(us) TotalTime(us) TimeRatio(%) --------------------------------------------------------------------------------------------------- Reshape 22 32473 0 0 32473 44.63% ConvElu 79 0 0 20315 20315 27.92% ConvRelu 42 0 0 5827 5827 8.01% Conv 32 0 0 5113 5113 7.03% ConvAdd 25 0 0 3280 3280 4.51% ConvAddRelu 23 0 0 3187 3187 4.38% Div 1 925 0 0 925 1.27% Sqrt 1 620 0 0 620 0.85% Concat 4 0 0 507 507 0.70% Gather 1 316 0 0 316 0.43% Clip 1 0 0 88 88 0.12% Mul 1 0 0 88 88 0.12% InputOperator 7 21 0 0 21 0.03% OutputOperator 1 6 0 0 6 0.01% ---------------------------------------------------------------------------------------------------
Looking good. Have you checked the accuracy of the output? You can run the onnx version, save the output in npy, then compare with the output from the rknn's model. Take the cosine similarity of the outputs. If it's more than 0.99, its quite accurate. (At least that is what I got for the 0.9.5 model)
I used the code from ChatGPT, result is 0.00053688075
, don't know how to take the cosine similarity of the outputs.
onnx_output = ...
rknn_output = ...
mse = np.mean((onnx_output - rknn_output) ** 2)
print("Mean Squared Error:", mse)
I used the code from ChatGPT, result is
0.00053688075
, don't know how to take the cosine similarity of the outputs.onnx_output = ... rknn_output = ... mse = np.mean((onnx_output - rknn_output) ** 2) print("Mean Squared Error:", mse)
from scipy.spatial.distance import cosine
# Accuracy eval
onnx_output = ...
rknn_output = ...
# Calculate the cosine similarity
cosine_similarity = 1 - cosine(onnx_output[0][0], rknn_output[0][0])
print("Consine similarity: " + str(cosine_similarity))
Looks like the MSE output is quite low, so I guess it's quite accurate. But just in case, do the consine similarity too, because it takes into account the signed error
By the way, are you by any chance using the new 2.0.0beta rknn-toolkit2 to convert the model? I am getting an error saying the input cannot be FP16...
By the way, are you by any chance using the new 2.0.0beta rknn-toolkit2 to convert the model? I am getting an error saying the input cannot be FP16...
Convert to fp32 first, some codes from openpilot
import onnx
import itertools
import numpy as np
def attributeproto_fp16_to_fp32(attr):
float32_list = np.frombuffer(attr.raw_data, dtype=np.float16)
attr.data_type = 1
attr.raw_data = float32_list.astype(np.float32).tobytes()
def convert_fp16_to_fp32(path):
model = onnx.load(path)
for i in model.graph.initializer:
if i.data_type == 10:
attributeproto_fp16_to_fp32(i)
for i in itertools.chain(model.graph.input, model.graph.output):
if i.type.tensor_type.elem_type == 10:
i.type.tensor_type.elem_type = 1
for i in model.graph.node:
for a in i.attribute:
if hasattr(a, 't'):
if a.t.data_type == 10:
attributeproto_fp16_to_fp32(a.t)
return model.SerializeToString()
model_data = convert_fp16_to_fp32("supercombo.onnx")
with open("/tmp/supercombo_fp32.onnx", "wb") as f:
f.write(model_data)
I have also tested it, but the NPU usage is very low, and it takes about 50ms to infer one frame.