rknn_matmul_set_core_mask failed on RK3588

airockchip / rknn-toolkit2

Other

844 stars 88 forks source link

rknn_matmul_set_core_mask failed on RK3588 #7

Open zhtroy opened 7 months ago

zhtroy commented 7 months ago

I'm running rknn_matmul_api_demo, I tried to run the demo in NPU core0 and core1 on RK3588, but failed

modified as below

  int ret = rknn_matmul_create(&ctx, &info, &io_attr);
  if (ret < 0)
  {
    fprintf(stderr, "rknn_matmul_create fail! ret=%d\n", ret);
    return -1;
  }
  ret = rknn_matmul_set_core_mask(ctx, RKNN_NPU_CORE_0_1);
  if (ret < 0)
  {
    fprintf(stderr, "rknn_matmul_coremask fail! ret=%d\n", ret);
    return -1;
  }

result

E RKNN: [18:09:24.926] Not support core mask: 3, fallback to single core auto mode
E RKNN: [18:09:24.926] NN Compiler/Model Version is 0.0.0 now
E RKNN: [18:09:24.926] rknn_set_core_mask: failed to set core mask: 3
rknn_matmul_coremask fail! ret=-1

0312birdzhang commented 7 months ago

what's the rknnrt version?

zhtroy commented 6 months ago

I cloned from rknn-toolkit2 and compiled the example from rknn-toolkit2/rknpu2/examples/rknn_matmul_api_demo So, I think maybe the rknnrt version is 1.6, accouding to the documentation

marty1885 commented 6 months ago

I think that function is simply broken/not working as advertised. Judging from the message Not support core mask: 3

I tried to run the demo in NPU core0 and core1 on RK3588

AFAIK. RKNN 1.6.0 does not support multi-core co-working matrix multiplication, but the runtime will automatically distribute matrix multiplications (and other model inferences) onto idle NPU cores. You can run 3 instances of matrix multiplications on 3 threads and they will be distributed correctly.

zhtroy commented 6 months ago

So , no parallel matrix multiplication, A*B=C, I'll break the A matrix into 3 smaller matrix A1 A2 A3 (row-wise) and merge the results C1 C2 C3

marty1885 commented 6 months ago

@zhtroy Yes. And I've benchmarked the performance given different matrix dimensions. You want to break the matrix across the columns to achieve better performance.

See my post about speed under different shapes and discussions in llama.cpp

Also it's messy since there's only 3 NPU cores. You either can only break it down to 2 pieces, or you use all 3 NPUs + some CPU to split matrices into 4 pieces. Pick your poison.