Closed junyuancat1 closed 8 months ago
I have already update acl && armnn to v22.08,this problem still exists. Is there any method to clean or kill the mali device and related threads, so that memory can be released when stuck happend, and restart armnn functions.
Hi,
My understanding from reading the original issue is that at some point between 5000 and 10000 inferences Arm NN hangs. Can you answer the following for me please:
Thanks, Colm.
Hey thanks! @Colm-in-Arm
Ok, that's a very interesting use case.
Looking at the differences in rk3399 I can see that you'll have access to NEON which isn't available on the rk3288. Can you change your test case to exclude CpuAcc and see if the problem persists. If you think all of the layers are available on GpuAcc perhaps you could remove CpuRef too. Similarly you could run only on CpuAcc and eliminate the Gpu as a source of the problem.
I will attempt to recreate something similar on a hikey 960 and see if I get the same problem.
Colm.
@Colm-in-Arm Oh!this works,I seems found out the problem When I disable the cpuacc and cpuref use, occurs an error like this. Warning: ERROR: Layer of type Concat is not supported on any preferred backend [GpuAcc ]
So, the problem may related with the concat layer was not supported in gpuacc. it may stuck when gpuacc results pollin to cpuacc when concat when rk3399 running 5k+ times? further, it's there any doc for me to develop the gpuacc concat layer support, or it's there any other solution to avoid this stuck both rk3288 rk3399 rk3588 this three development boards I mainly use.
plue, cause of the shortage of cpu resources, I hope that the vast majority of calculations are completed in the gpu~
here's the detail of the error concat. here comes an strange concat layer with just one input in my onnx model after transed form pt model.and this layer may cause 'armnn::InvalidArgumentException' what(): Number of inputs provided does not match network. I can manage the onnx model to erase this layer or add an single input concat layer to set the output same as the input. This may help to running the program with gpuacc without cpuacc and cpuref, but i am not sure whether this can solve the stuck problem~
I'm pleased you're making some progress. I agree that the concat layer looks very odd but you said this previously worked on v7 + GpuAcc. Could you try the combination of armnn::Compute::GpuAcc, armnn::Compute::CpuRef in that order please. In this case the concat layer might work on the CpuRef implementation. The entire network except that layer would run on the GPU. That way we could isolate the fault to some interaction with CpuAcc.
Colm.
@Colm-in-Arm I've tried combine gpuacc and cpuref. This stuck still exists.
I've solved the concat problem, and try running all inferences in gpuacc now. but this stuck still exists.
Hey, I am trying on several rk3399 boards. The boards installed on the robot shows high frequency of this problem (I can guarantee that only my thread using gpu resources). I also try to reproduce this problem on several clean rk3399 boards( installed in computer room). The clean boards have run for 200k+ times and this problem has not yet occurred. I suspect that it may be a problem with the power supply of the gpu. Our RK gpu device use simple_ondemand goverment. The voltage maybe cannot keep up with the frequency. So I changed goverment from odemand to performance whitch has the highest frequency, and it will no longer de regulate the frequency. But the problem is the stuck thread been straced to event (pollin), It seems like a data copy problem? I'm not sure LOL. I'll keep following this and update this issue~ thanks~
some dead locks~
Is it possible that the voltage instability causes gpu fail to return an infer result, then the lock cannot be unlocked?
I found out when this stuck happend, gpu can back to normal soon,but this stuck will be forever. I could not unload
classifier network by UnloadNetwork function when stuck. After I kill my process and re run, infer will back to normal again.
I'm afraid you've gone beyond my knowledge of the GPU architecture.
I did try running many thousands of inferences on a HiKey 960 device and I didn't hit any problems.
I might refer this to my colleagues in Arm Computer Library. They have the GPU experience.
Colm.
I don't think OpenCL has an abort mechanism. But since you are able to kill the process and restart it, the process is not in an uninterruptable sleep state.
If you really want to avoid killing the process and restarting, I'd maybe try installing a signal handler for SIGUSR1 in main, then starting a watchdog thread that (somehow) detects the hang state and sends SIGUSR1 to the process. That will cause any system calls to wake up, which might give them a chance to detect the error state, return an error, and let you unload/reload the network, etc.
I really don't know if it would be worth the trouble, though! It's quite possible that something in the OpenCL driver or Arm NN would be left in an inconsistent state and require restarting the process anyway :-(
Hope that helps, Matthe
Thanks for reply! @MatthewARM My current method is to turn the infer task into an independent process, and let it communicate with my main process. (However, there will be a lot of performance loss caused by data copy and trans), so I prefer a watch dog.
And I also tried add a watchdog thread, but I found that when the blocking occurs, armnn unload mode cannot release the network (UnloadNetwork function returns false), When this happens, how can I release it correctly?
Here's two process,
I wrote a watchdog thread in infer process with a function which would cause a segment fault. When watchdog detected the infer thread not update infer result for 30s, call this segemt fault function and whole process will be killed. Or I may call system("sudo systemctl restart infer_process.service") to restart infer process by service. But is this the only way to solve this problem? I dont want an independent infer process(data communicate between already became a very complex logic, not to say the image data copy and send loss). I wish an infer thread, when watchdog detect infer been stuck, I could release armnn network and resources by watchdog thread, and armnn throw an exception when the infer could not get results(unlock the lock when network unload detected while condition var waitting ). Is that doable? LOL
Hi @junyuancat1
Since we could not reproduce the problem on our devices and there has been no activity I'm closing this issue. If you still need help with this then please create a new ticket.
https://github.com/ARM-software/armnn/issues/638