ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.17k stars 309 forks source link

armnn v22.08 Get stuck in rk3399 mali-T860 #708

Closed junyuancat1 closed 8 months ago

junyuancat1 commented 1 year ago

https://github.com/ARM-software/armnn/issues/638

junyuancat1 commented 1 year ago

I have already update acl && armnn to v22.08,this problem still exists. Is there any method to clean or kill the mali device and related threads, so that memory can be released when stuck happend, and restart armnn functions.

Colm-in-Arm commented 1 year ago

Hi,

My understanding from reading the original issue is that at some point between 5000 and 10000 inferences Arm NN hangs. Can you answer the following for me please:

Thanks, Colm.

junyuancat1 commented 1 year ago

Hey thanks! @Colm-in-Arm

  1. yes ,all inferences occurring in a single process's single thread.
  2. I load two network to one runtime instance. I built the first net with Multiplication Layer as a image rescale net. and the second net is a 1mb onnx(from yolov5) net.
  3. yes, all inferences using the same runtime instance, i load two network in sequence, and all inference will running as such sequence
  4. nope
  5. 1mb onnx file, using 320*320 rgbnchw images as input. plus: it is wired that the stuck occured only when running on rk3399(armv8). but it never stucked when running on rk3288(armv7) here's my funcs cpp file, thanks again! lost-det.zip
Colm-in-Arm commented 1 year ago

Ok, that's a very interesting use case.

Looking at the differences in rk3399 I can see that you'll have access to NEON which isn't available on the rk3288. Can you change your test case to exclude CpuAcc and see if the problem persists. If you think all of the layers are available on GpuAcc perhaps you could remove CpuRef too. Similarly you could run only on CpuAcc and eliminate the Gpu as a source of the problem.

I will attempt to recreate something similar on a hikey 960 and see if I get the same problem.

Colm.

junyuancat1 commented 1 year ago

@Colm-in-Arm Oh!this works,I seems found out the problem When I disable the cpuacc and cpuref use, occurs an error like this. Warning: ERROR: Layer of type Concat is not supported on any preferred backend [GpuAcc ]

So, the problem may related with the concat layer was not supported in gpuacc. it may stuck when gpuacc results pollin to cpuacc when concat when rk3399 running 5k+ times? further, it's there any doc for me to develop the gpuacc concat layer support, or it's there any other solution to avoid this stuck both rk3288 rk3399 rk3588 this three development boards I mainly use.

junyuancat1 commented 1 year ago

plue, cause of the shortage of cpu resources, I hope that the vast majority of calculations are completed in the gpu~

junyuancat1 commented 1 year ago

image here's the detail of the error concat. here comes an strange concat layer with just one input in my onnx model after transed form pt model.and this layer may cause 'armnn::InvalidArgumentException' what(): Number of inputs provided does not match network. I can manage the onnx model to erase this layer or add an single input concat layer to set the output same as the input. This may help to running the program with gpuacc without cpuacc and cpuref, but i am not sure whether this can solve the stuck problem~

Colm-in-Arm commented 1 year ago

I'm pleased you're making some progress. I agree that the concat layer looks very odd but you said this previously worked on v7 + GpuAcc. Could you try the combination of armnn::Compute::GpuAcc, armnn::Compute::CpuRef in that order please. In this case the concat layer might work on the CpuRef implementation. The entire network except that layer would run on the GPU. That way we could isolate the fault to some interaction with CpuAcc.

Colm.

junyuancat1 commented 1 year ago

@Colm-in-Arm I've tried combine gpuacc and cpuref. This stuck still exists.

junyuancat1 commented 1 year ago

I've solved the concat problem, and try running all inferences in gpuacc now. but this stuck still exists.

junyuancat1 commented 1 year ago

Hey, I am trying on several rk3399 boards. The boards installed on the robot shows high frequency of this problem (I can guarantee that only my thread using gpu resources). I also try to reproduce this problem on several clean rk3399 boards( installed in computer room). The clean boards have run for 200k+ times and this problem has not yet occurred. I suspect that it may be a problem with the power supply of the gpu. Our RK gpu device use simple_ondemand goverment. The voltage maybe cannot keep up with the frequency. So I changed goverment from odemand to performance whitch has the highest frequency, and it will no longer de regulate the frequency. But the problem is the stuck thread been straced to event (pollin), It seems like a data copy problem? I'm not sure LOL. I'll keep following this and update this issue~ thanks~

junyuancat1 commented 1 year ago

image

image

some dead locks~

junyuancat1 commented 1 year ago

Is it possible that the voltage instability causes gpu fail to return an infer result, then the lock cannot be unlocked? I found out when this stuck happend, gpu can back to normal soon,but this stuck will be forever. I could not unload
classifier network by UnloadNetwork function when stuck. After I kill my process and re run, infer will back to normal again.

Colm-in-Arm commented 1 year ago

I'm afraid you've gone beyond my knowledge of the GPU architecture.

I did try running many thousands of inferences on a HiKey 960 device and I didn't hit any problems.

I might refer this to my colleagues in Arm Computer Library. They have the GPU experience.

Colm.

MatthewARM commented 1 year ago

I don't think OpenCL has an abort mechanism. But since you are able to kill the process and restart it, the process is not in an uninterruptable sleep state.

If you really want to avoid killing the process and restarting, I'd maybe try installing a signal handler for SIGUSR1 in main, then starting a watchdog thread that (somehow) detects the hang state and sends SIGUSR1 to the process. That will cause any system calls to wake up, which might give them a chance to detect the error state, return an error, and let you unload/reload the network, etc.

I really don't know if it would be worth the trouble, though! It's quite possible that something in the OpenCL driver or Arm NN would be left in an inconsistent state and require restarting the process anyway :-(

Hope that helps, Matthe

junyuancat1 commented 1 year ago

Thanks for reply! @MatthewARM My current method is to turn the infer task into an independent process, and let it communicate with my main process. (However, there will be a lot of performance loss caused by data copy and trans), so I prefer a watch dog.

And I also tried add a watchdog thread, but I found that when the blocking occurs, armnn unload mode cannot release the network (UnloadNetwork function returns false), When this happens, how can I release it correctly?

junyuancat1 commented 1 year ago

Here's two process,

I wrote a watchdog thread in infer process with a function which would cause a segment fault. When watchdog detected the infer thread not update infer result for 30s, call this segemt fault function and whole process will be killed. Or I may call system("sudo systemctl restart infer_process.service") to restart infer process by service. But is this the only way to solve this problem? I dont want an independent infer process(data communicate between already became a very complex logic, not to say the image data copy and send loss). I wish an infer thread, when watchdog detect infer been stuck, I could release armnn network and resources by watchdog thread, and armnn throw an exception when the infer could not get results(unlock the lock when network unload detected while condition var waitting ). Is that doable? LOL

morgolock commented 8 months ago

Hi @junyuancat1

Since we could not reproduce the problem on our devices and there has been no activity I'm closing this issue. If you still need help with this then please create a new ticket.