ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.17k stars 309 forks source link

Does ExecuteNetwork support "GpuAcc" runtime ? #767

Closed somasundaram1702 closed 2 months ago

somasundaram1702 commented 5 months ago

Hello Team,

I am trying to execute a NeuralNetwork in my custom developed SoC in a emulation platform (ZEBU). I have Mali G710 Gpu integrated with Cortex Arm53 CPU. I am able to successfully utilize ExecuteNetwork to run '.armnn' or '.onnx' based mobilenet model on the CPU. However, when I try to execute it on a GPU, I get the below error, GPU_Err

Note: I get the above error while using both the pre-built binaries and also with after building ExecuteNetwork on my own.

I could note from the GlobalConfig.cmake file that the ExecuteNetwork will get generated only under the below conditions are satisfied. While observing the below, the ARMCOMPUTECL must be set to 0 which is necessary for accelarated computing. I believe this should be enabled to compute the neural network on GPU.

option(EXECUTE_NETWORK_STATIC " This is a limited experimental build that is entirely static. It currently only supports being set by changing the current CMake default options like so: BUILD_TF_LITE_PARSER=1/0 BUILD_ARMNN_SERIALIZER=1/0 ARMCOMPUTENEON=1/0 ARMNNREF=1/0 ARMCOMPUTECL=0 BUILD_ONNX_PARSER=0 BUILD_CLASSIC_DELEGATE=0 BUILD_OPAQUE_DELEGATE=0 BUILD_TIMELINE_DECODER=0 BUILD_BASE_PIPE_SERVER=0 BUILD_UNIT_TESTS=0 ARMNN_SAMPLE_APPS_ENABLED=0 BUILD_SHARED_LIBS=0 BUILD_GATORD_MOCK=0 HEAP_PROFILING=0 LEAK_CHECKING=0" OFF)

So, I would like to know if the ExecuteNetwork supports to execute the NeuralNetwork on GPU ? Even though we pass 'GpuAcc' as the runtime, was it built appropriately ?

Colm-in-Arm commented 5 months ago

Hello.

Yes ExecuteNetwork does support acceleration on Mali GPU's. In the example you show it appears the a part of the model is not supported. Can I suggest you try specifying both CPU and GPU: "-c GpuAcc,CpuAcc" Any layer that is not supported on GPU will fall back to CPU.

Colm.

somasundaram1702 commented 5 months ago

@Colm-in-Arm : Thanks for your response. As I mentioned, I want to execute the Neural Network in my SoC. Right I cannot install python, numpy or any other packages as my SoC is isolated from the web access.

Kindly help me with an alternate option to execute my NeuralNetwork in the Mali GPU g710.

Also Does ExecuteNetwork help to accelarate the NeuralNetowrk in any other ARM based GPU ?

FrancisMurtagh-arm commented 5 months ago

@somasundaram1702

From the error message there looks to be an issue with OpenCl on your device: "failed creating base context during opening of kernel driver"

This might result in GpuAcc backend failing to register.

Can you show us the output from calling clinfo?

Your picture also cuts off the ending of the message "Current platform provides:.." what does that say? If GpuAcc isnt in that list then it confirms it didnt register.

Thanks, Francis.

somasundaram1702 commented 5 months ago

@FrancisMurtagh-arm: Yes you are right, the "GpuAcc" isn't listed there.

It says "current platform provides [CpuAcc, CpuRef]". Please note that I am using the opencl files generated by utilizing the DDK specific to Mali G710 GPU.

Also the test application "mali_base_csf_cqs_test" comes along with the DDK gets passed. Which indicates that the certain operations are functional on the GPU.

when I check for clinfo: -sh: clinfo: not found. I believe the CL utilities are not installed.

@FrancisMurtagh-arm :Is this why the ExecuteNetwork is not able to identify the GPU ? Kindly give some information, Thanks

FrancisMurtagh-arm commented 5 months ago

Hi @somasundaram1702,

Are you using a distro like ubuntu or debian? Can you get it via sudo apt install clinfo ?

Our GPU Backend dynamically loads one of these "libOpenCL.so", "libGLES_mali.so", "libmali.so" from your LD_LIBRARY_PATH, if it can't find at least one of them or there is an issue with it it may fail to register.

Francis.

somasundaram1702 commented 5 months ago

@FrancisMurtagh-arm : No I am not using any of the distro. I am building Linux-5.10.198 from scratch. Also as I mentioned earlier, my SoC is not connected to web, so "sudo apt install --" does not work. (Trying my best to include "Clinfo" while building rootfs)

Note: "ExecuteNetwork works fine when I use "CpuAcc" or "CpuRef".

W.r.t Dynamic Loading: Yes as you mentioned, I set LD_LIBRARY_PATH pointing to "libOpenCL.so", "libGLES_mali.so", "libmali.so" files. Initially I used to get the below error,

Info: ArmNN v33.0.0 Couldn't find any of the following OpenCL library: libOpenCL.so libGLES_mali.so libmali.so

But after setting the LD_LIBRARY_PATH with proper ".so" files generated by the DDK, the driver is able to pick the files and the above error disappeared. Which proves that the driver is able to load the files.

Note that I am using "insmod" to insert my drivers from my linux terminal. Below are the drivers that I am inserting on my SoC along with firmware,

1) dma-buf-test-exporter.ko
2) memory_group_manager.ko
3) protected_memory_allocator.ko

Kindly guide me with way forward.

Thanks !

FrancisMurtagh-arm commented 5 months ago

Hi,

clinfo would be the easiest way I can think of for seeing if OpenCl is configured correctly.

Can you copy over a .deb archive of clinfo and dpkg -i install it?

Another user had a similar issue: https://github.com/ARM-software/armnn/issues/173

Regards, Francis.

somasundaram1702 commented 4 months ago

Hello @FrancisMurtagh-arm,

I was able to recompile the DDK by enabling OpenCL capabilities. I get the below error while executing Mobilenet application through ExecuteNetwork,

Error: An error occurred attempting to execute a workload: CL error: clFlush. Error code: -36 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClFullyConnectedWorkload.cpp:110] Info: Execution time: 34389.63 ms. terminate called after throwing an instance of 'armnn::Exception' what(): IRuntime::EnqueueWorkload failed [ 676.399109] mali 70000000.gpu: Failed to soft-reset GPU (timed out after 500 ms), now attempting a hard reset [ 676.399355] mali 70000000.gpu: reloading firmware [ 676.413358] mali 70000000.gpu: Reset complete Aborted

Could you direct me for a resolution ?

FrancisMurtagh-arm commented 4 months ago

Hi,

That seems like an issue with OpenCL configuration rather than on ArmNN side, can you share the Mobilenet model you are using and I can try run it to debug?

It might be worth asking in https://github.com/ARM-software/ComputeLibrary/issues

Regards, Francis.

somasundaram1702 commented 4 months ago

@FrancisMurtagh-arm:

Please find the mobilenet model used. Note that this model runs successfully on CpuAcc runtime. Also I have given the complete log below. Kindly assist. MobileNet.zip

0:02:00: Warning: DEPRECATED: The program option 'model-format' is deprecated and will be removed soon. The model-format is now automatically set. 0:02:00: Warning: No input files provided, input tensors will be filled with 0s. 0:02:00: Info: ArmNN v33.0.0 0:02:01: [ 89.391250] mali 70000000.gpu: Loading Mali firmware 0x1010000 0:02:01: [ 89.412008] mali 70000000.gpu: Mali firmware git_sha: ba6471e0f3fa3a974709abd2628da574543b3c1d 0:02:09: Info: Initialization time: 495.02 ms. 0:02:12: Info: Optimization time: 580.56 ms 0:02:12:
0:02:13: [ 91.178812] random: crng init done 0:05:41: Warning: The input data was generated, note that the output will not be useful 0:05:41: ===== Network Info ===== 0:05:41: Inputs in order: 0:05:41: InputLayer, [1,3,224,224], Float32 0:05:41: Outputs in order: 0:05:41: OutPutLayer, [1,1000], Float32 0:05:41:
0:06:58: [ 230.783654] mali 70000000.gpu: AS_ACTIVE bit stuck for as 1. Might be caused by unstable GPU clk/pwr or faulty system 0:06:58: [ 230.783768] mali 70000000.gpu: Preparing to soft-reset GPU 0:06:58: [ 230.783874] mali 70000000.gpu: Wait for AS_ACTIVE bit failed for as 1, before sending MMU command 4 0:06:58: [ 230.783980] mali 70000000.gpu: Flush for GPU page table update did not complete 0:06:58: [ 230.785483] mali 70000000.gpu: Unhandled Page fault in AS1 at VA 0x00007FDFFA80A980 0:06:58: [ 230.785483] Reason: Memory is not growable 0:06:58: [ 230.785483] raw fault status: 0x230002C3 0:06:58: [ 230.785483] exception type 0xC3: TRANSLATION_FAULT at level 3 0:06:58: [ 230.785483] access type 0x2: READ 0:06:58: [ 230.785483] source id 0x2300 0:06:58: [ 230.785483] pid: 1291 0:06:58: [ 230.785704] mali 70000000.gpu: Failed to lock AS 1 for ctx 1291_0 0:08:15: [ 231.310622] mali 70000000.gpu: Stuck waiting on CLEAN_CACHES_COMPLETED bit, might be due to unstable GPU clk/pwr or possible faulty FPGA connector 0:08:15: [ 231.310759] mali 70000000.gpu: Failed to flush GPU cache when disabling AS 1 for ctx 1291_0 0:08:15: [ 231.317143] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.320093] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.324444] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.328293] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.331804] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.335644] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.339814] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.343293] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.347092] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.351120] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.354593] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.358192] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.361243] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.362342] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.366294] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.369592] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.373812] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.377661] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.380943] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.384593] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.388692] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.391753] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.395848] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.400042] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.402593] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.406943] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.410293] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.413870] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.417643] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.421817] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.425258] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.429093] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.433093] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.435662] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.437093] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.440658] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.444593] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.448443] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.451804] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.455856] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.459693] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.463104] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.466393] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.470641] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.473693] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.477768] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.481947] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.484896] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.489093] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:15: [ 231.493170] mali 70000000.gpu: Flush for GPU page table update did not complete 0:08:51: [ 263.158760] mali 70000000.gpu: [74390095] Suspend request sent on CSG slots 0x1 timed out for slots 0x1 0:08:51: [ 263.158874] mali 70000000.gpu: Timeout waiting for CSG slots to suspend before reset, slot_mask: 0x01 0:08:51: [ 263.702757] mali 70000000.gpu: Cache clean timed out. Might be caused by unstable GPU clk/pwr or faulty system 0:08:51: [ 263.702868] mali 70000000.gpu: [74479056] Timeout waiting for CACHE_CLN_INV_L2_LSC 0:08:51: [ 263.703011] mali 70000000.gpu: Quit idle for failing to prevent gpu reset. 0:10:07: [ 264.227966] mali 70000000.gpu: AS_ACTIVE bit stuck for as 0. Might be caused by unstable GPU clk/pwr or faulty system 0:10:07: [ 264.228081] mali 70000000.gpu: Flush for GPU page table update did not complete 0:10:07: [ 264.229444] mali 70000000.gpu: Flush for GPU page table update did not complete 0:10:07: [ 264.229554] mali 70000000.gpu: Evicting context 1291_0 slots: 0x01 0:10:07: [ 264.254933] mali 70000000.gpu: Resetting GPU (allowing up to 500 ms) 0:10:07: [ 264.255022] mali 70000000.gpu: Register state: 0:10:07: [ 264.255119] mali 70000000.gpu: GPU_IRQ_RAWSTAT=0x00040200 GPU_STATUS=0x00000001 MCU_STATUS=0x00000001 0:10:07: [ 264.255238] mali 70000000.gpu: JOB_IRQ_RAWSTAT=0x00000000 MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00000000 0:10:07: [ 264.255365] mali 70000000.gpu: GPU_IRQ_MASK=0x00000000 JOB_IRQ_MASK=0x00000000 MMU_IRQ_MASK=0x00000000 0:10:07: [ 264.255476] mali 70000000.gpu: PWR_OVERRIDE0=0x00000000 PWR_OVERRIDE1=0x00000000 0:10:07: [ 264.255583] mali 70000000.gpu: SHADER_CONFIG=0x00000000 L2_MMU_CONFIG=0x00000000 TILER_CONFIG=0x00000000 0:10:07: Error: An error occurred attempting to execute a workload: CL error: clFlush. Error code: -36 at function Execute [/devenv/armnn/src/backends/cl/workloads/ClFullyConnectedWorkload.cpp:110] 0:10:07: Info: Execution time: 34094.88 ms. 0:10:07: terminate called after throwing an instance of 'armnn::Exception' 0:10:07: what(): IRuntime::EnqueueWorkload failed 0:10:08: [ 264.755764] mali 70000000.gpu: Failed to soft-reset GPU (timed out after 500 ms), now attempting a hard reset 0:10:08: [ 264.756624] mali 70000000.gpu: reloading firmware 0:10:08: [ 264.835837] mali 70000000.gpu: Reset complete 0:10:10: Aborted

somasundaram1702 commented 3 months ago

@FrancisMurtagh-arm : Where you able to execute the model ? I am stuck here badly

FrancisMurtagh-arm commented 3 months ago

Hi @somasundaram1702,

I ran your model successfully on an Odroid n2. I would suggest you ask for help in https://github.com/ARM-software/ComputeLibrary/issues as this doesn't appear to be an ArmNN issue rather an OpenCL configuration issue.

Regards, Francis.

somasundaram1702 commented 2 months ago

@FrancisMurtagh-arm: The issue is fixed. There was a hardware bug.