AndrewJNg / NPU-on-rk3588

4 stars 2 forks source link

What is the accuracy and execution speed of the model? #1

Open 0312birdzhang opened 5 months ago

0312birdzhang commented 5 months ago

I have also tested it, but the NPU usage is very low, and it takes about 50ms to infer one frame.

iXcess commented 5 months ago

Which model did you infer with? Are using using all the NPU cores?

iXcess commented 5 months ago

supercombo @master 341b81c0a5c5f378f749a8ba07514f48f0f9faa5. The latest model is taking me 100ms to run. Haven't had time to check the older models. Accuracy is bad, not sure if it's my implementation

0312birdzhang commented 5 months ago

Which model did you infer with? Are using using all the NPU cores?

https://github.com/commaai/openpilot/commit/531c13f2acd8d15914aba34a40b356dcb7f02ced

Yes, with RKNNLite.NPU_CORE_0_1_2.

supercombo @master 341b81c0a5c5f378f749a8ba07514f48f0f9faa5. The latest model is taking me 100ms to run. Haven't had time to check the older models. Accuracy is bad, not sure if it's my implementation

The latest model is 100ms too. I'm thinking use C++ rknn_api, but i'm a Java/Python developer, it's hard for me.

iXcess commented 5 months ago

I am already running using the C++ rknn_api, 10ms savings for the latest model compared to using python, not worth. I think we need to look at the underlying issue. Get the performance of each op type.

I just did a quick test with the older models. v0.8.13 runs at 40ms single core, lemon pie model runs at 70ms single core, and the latest model certified herbalist model runs at 110ms single core

iXcess commented 5 months ago

I have also tested it, but the NPU usage is very low, and it takes about 50ms to infer one frame.

Also what dev platform are you running on? OS? Are you able to do a perf_debug?

0312birdzhang commented 5 months ago

Also what dev platform are you running on? OS? Are you able to do a perf_debug?

Same as this project. https://github.com/Joshua-Riek/ubuntu-rockchip

The perf log: https://gist.github.com/0312birdzhang/810b1fd40427484f7167e9bd4bf4f1cf

iXcess commented 5 months ago

Oh thank you. Looks like the reshape is taking bulk of the time moving things around from the NPU to the CPU. Maybe we can remove reshape and find a substitute for it.

By the way, how are you getting the perf log? I don't quite understand how does adb work between a linux host and a linux client.

0312birdzhang commented 5 months ago

By the way, how are you getting the perf log?

Search rknn.eval_perf() in 01_Rockchip_RKNPU_Quick_Start_RKNN_SDK_V1.6.0_EN.pdf, the python script runs on Host and the dev board plays as a client via adb.

I don't quite understand how does adb work between a linux host and a linux client.

I'm using Orangepi 5, and need to flash the stock orangepi ubuntu(you can burn to sdcard), otherwise adb not works.

iXcess commented 5 months ago

Screenshot from 2024-03-22 15-25-18

latest supercombo model. Pow is the bottleneck

iXcess commented 5 months ago

conversion.log

After removing pow from the onnx graph and running the perf again. 34ms

0312birdzhang commented 5 months ago

34ms is really good, we need find some solution to replace Pow.

iXcess commented 5 months ago

34ms is really good, we need find some solution to replace Pow.

Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?

0312birdzhang commented 5 months ago

Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?

I'm a newbie on this, i asked on RKNN group, they said you can take a look at the rknn_matmul_api_demo, it can runs on GPU.

iXcess commented 5 months ago

Agree. For now I need to run an accuracy benchmark without replacing the pow first. Do you to happen to have any?

I'm a newbie on this, i asked on RKNN group, they said you can take a look at the rknn_matmul_api_demo, it can runs on GPU.

Cool stuff, yeah I am already checking matmul. Need sometime to digest and understand. I can't join the qq group because my number is somehow barred. Could you also ask them if they plan to support more op types in the near future?

0312birdzhang commented 5 months ago

Officials never make promises, I will continue to pay attention to this

0312birdzhang commented 5 months ago

with 2.0.0b0 released, Moonrise Model perf logs:

---------------------------------------------------------------------------------------------------
                                 Operator Time Consuming Ranking Table            
---------------------------------------------------------------------------------------------------
OpType             CallNumber   CPUTime(us)  GPUTime(us)  NPUTime(us)  TotalTime(us)  TimeRatio(%)  
---------------------------------------------------------------------------------------------------
ConvElu            79           0            0            14399        14399          64.09%        
ConvRelu           42           0            0            2492         2492           11.09%        
ConvAdd            25           0            0            1798         1798           8.00%         
Conv               32           0            0            1717         1717           7.64%         
ConvAddRelu        23           0            0            1278         1278           5.69%         
Reshape            22           430          0            0            430            1.91%         
Concat             4            0            0            172          172            0.77%         
Div                1            85           0            0            85             0.38%         
Sqrt               1            33           0            0            33             0.15%         
Mul                1            0            0            21           21             0.09%         
Clip               1            0            0            16           16             0.07%         
Gather             1            12           0            0            12             0.05%         
InputOperator      7            9            0            0            9              0.04%         
OutputOperator     1            4            0            0            4              0.02%         
---------------------------------------------------------------------------------------------------
0312birdzhang commented 5 months ago

With 0.9.5 model:

---------------------------------------------------------------------------------------------------
                                 Operator Time Consuming Ranking Table            
---------------------------------------------------------------------------------------------------
OpType             CallNumber   CPUTime(us)  GPUTime(us)  NPUTime(us)  TotalTime(us)  TimeRatio(%)  
---------------------------------------------------------------------------------------------------
Pow                19           88001        0            0            88001          64.13%        
Conv               72           0            0            12220        12220          8.90%         
Mul                79           0            0            11523        11523          8.40%         
Add                40           0            0            5791         5791           4.22%         
ConvRelu           44           0            0            5231         5231           3.81%         
Tanh               19           0            0            3097         3097           2.26%         
ConvAdd            14           0            0            2876         2876           2.10%         
ConvAddRelu        24           0            0            2621         2621           1.91%         
Concat             5            2098         0            280          2378           1.73%         
Reshape            33           1529         0            191          1720           1.25%         
Transpose          4            0            0            356          356            0.26%         
MatMul             2            347          0            0            347            0.25%         
exLayerNorm        2            0            0            282          282            0.21%         
Split              2            0            0            208          208            0.15%         
Elu                1            0            0            169          169            0.12%         
exSoftmax13        1            0            0            160          160            0.12%         
ConvSigmoid        1            0            0            75           75             0.05%         
Div                1            70           0            0            70             0.05%         
Clip               1            0            0            64           64             0.05%         
Gather             1            18           0            0            18             0.01%         
Sqrt               1            11           0            0            11             0.01%         
InputOperator      8            8            0            0            8              0.01%         
OutputOperator     1            4            0            0            4              0.00%         
---------------------------------------------------------------------------------------------------
iXcess commented 5 months ago

with 2.0.0b0 released, Moonrise Model perf logs:

---------------------------------------------------------------------------------------------------
                                 Operator Time Consuming Ranking Table            
---------------------------------------------------------------------------------------------------
OpType             CallNumber   CPUTime(us)  GPUTime(us)  NPUTime(us)  TotalTime(us)  TimeRatio(%)  
---------------------------------------------------------------------------------------------------
ConvElu            79           0            0            14399        14399          64.09%        
ConvRelu           42           0            0            2492         2492           11.09%        
ConvAdd            25           0            0            1798         1798           8.00%         
Conv               32           0            0            1717         1717           7.64%         
ConvAddRelu        23           0            0            1278         1278           5.69%         
Reshape            22           430          0            0            430            1.91%         
Concat             4            0            0            172          172            0.77%         
Div                1            85           0            0            85             0.38%         
Sqrt               1            33           0            0            33             0.15%         
Mul                1            0            0            21           21             0.09%         
Clip               1            0            0            16           16             0.07%         
Gather             1            12           0            0            12             0.05%         
InputOperator      7            9            0            0            9              0.04%         
OutputOperator     1            4            0            0            4              0.02%         
---------------------------------------------------------------------------------------------------

This one looks promising. Were there any difference before the 2.0.0beta update?

0312birdzhang commented 5 months ago

Were there any difference before the 2.0.0beta update?

Yes, Reshape use less CPU times.

---------------------------------------------------------------------------------------------------
                                 Operator Time Consuming Ranking Table            
---------------------------------------------------------------------------------------------------
OpType             CallNumber   CPUTime(us)  GPUTime(us)  NPUTime(us)  TotalTime(us)  TimeRatio(%)  
---------------------------------------------------------------------------------------------------
Reshape            22           32473        0            0            32473          44.63%        
ConvElu            79           0            0            20315        20315          27.92%        
ConvRelu           42           0            0            5827         5827           8.01%         
Conv               32           0            0            5113         5113           7.03%         
ConvAdd            25           0            0            3280         3280           4.51%         
ConvAddRelu        23           0            0            3187         3187           4.38%         
Div                1            925          0            0            925            1.27%         
Sqrt               1            620          0            0            620            0.85%         
Concat             4            0            0            507          507            0.70%         
Gather             1            316          0            0            316            0.43%         
Clip               1            0            0            88           88             0.12%         
Mul                1            0            0            88           88             0.12%         
InputOperator      7            21           0            0            21             0.03%         
OutputOperator     1            6            0            0            6              0.01%         
---------------------------------------------------------------------------------------------------
iXcess commented 5 months ago

Were there any difference before the 2.0.0beta update?

Yes, Reshape use less CPU times.

---------------------------------------------------------------------------------------------------
                                 Operator Time Consuming Ranking Table            
---------------------------------------------------------------------------------------------------
OpType             CallNumber   CPUTime(us)  GPUTime(us)  NPUTime(us)  TotalTime(us)  TimeRatio(%)  
---------------------------------------------------------------------------------------------------
Reshape            22           32473        0            0            32473          44.63%        
ConvElu            79           0            0            20315        20315          27.92%        
ConvRelu           42           0            0            5827         5827           8.01%         
Conv               32           0            0            5113         5113           7.03%         
ConvAdd            25           0            0            3280         3280           4.51%         
ConvAddRelu        23           0            0            3187         3187           4.38%         
Div                1            925          0            0            925            1.27%         
Sqrt               1            620          0            0            620            0.85%         
Concat             4            0            0            507          507            0.70%         
Gather             1            316          0            0            316            0.43%         
Clip               1            0            0            88           88             0.12%         
Mul                1            0            0            88           88             0.12%         
InputOperator      7            21           0            0            21             0.03%         
OutputOperator     1            6            0            0            6              0.01%         
---------------------------------------------------------------------------------------------------

Looking good. Have you checked the accuracy of the output? You can run the onnx version, save the output in npy, then compare with the output from the rknn's model. Take the cosine similarity of the outputs. If it's more than 0.99, its quite accurate. (At least that is what I got for the 0.9.5 model)

0312birdzhang commented 5 months ago

I used the code from ChatGPT, result is 0.00053688075, don't know how to take the cosine similarity of the outputs.

onnx_output = ...
rknn_output = ...
mse = np.mean((onnx_output - rknn_output) ** 2)
print("Mean Squared Error:", mse)
iXcess commented 5 months ago

I used the code from ChatGPT, result is 0.00053688075, don't know how to take the cosine similarity of the outputs.

onnx_output = ...
rknn_output = ...
mse = np.mean((onnx_output - rknn_output) ** 2)
print("Mean Squared Error:", mse)
    from scipy.spatial.distance import cosine
    # Accuracy eval
    onnx_output = ...
    rknn_output = ...

    # Calculate the cosine similarity
    cosine_similarity = 1 - cosine(onnx_output[0][0], rknn_output[0][0])
    print("Consine similarity: " + str(cosine_similarity))

Looks like the MSE output is quite low, so I guess it's quite accurate. But just in case, do the consine similarity too, because it takes into account the signed error

iXcess commented 5 months ago

By the way, are you by any chance using the new 2.0.0beta rknn-toolkit2 to convert the model? I am getting an error saying the input cannot be FP16...

0312birdzhang commented 5 months ago

By the way, are you by any chance using the new 2.0.0beta rknn-toolkit2 to convert the model? I am getting an error saying the input cannot be FP16...

Convert to fp32 first, some codes from openpilot

import onnx
import itertools
import numpy as np

def attributeproto_fp16_to_fp32(attr):
  float32_list = np.frombuffer(attr.raw_data, dtype=np.float16)
  attr.data_type = 1
  attr.raw_data = float32_list.astype(np.float32).tobytes()

def convert_fp16_to_fp32(path):
  model = onnx.load(path)
  for i in model.graph.initializer:
    if i.data_type == 10:
      attributeproto_fp16_to_fp32(i)
  for i in itertools.chain(model.graph.input, model.graph.output):
    if i.type.tensor_type.elem_type == 10:
      i.type.tensor_type.elem_type = 1
  for i in model.graph.node:
    for a in i.attribute:
      if hasattr(a, 't'):
        if a.t.data_type == 10:
          attributeproto_fp16_to_fp32(a.t)
  return model.SerializeToString()

model_data = convert_fp16_to_fp32("supercombo.onnx")
with open("/tmp/supercombo_fp32.onnx", "wb") as f:
    f.write(model_data)