Closed avinnus closed 3 years ago
Hi @avinnus
A bunch of questions to start the discution:
Have you tried PrecompiledCompute backend ->
Some layer are probably much more expensive than other, it would probably be interresting to profile each layer individually and create an execution plan where you schedule a few layer per frame when they are not expensive, this would probably improve total inference time/latency (cache coherency might help there).
Hope it helps! Florent
Thanks for your quick reply!
I have tried running the CNN on the GPU before, however this results in the app crashing down directly when opening. Do you have any idea why this is the case? These are all activities that occur when opening the app with ComputePrecompiled, my device has a Mali-T880 MP12 GPU:
06-14 13:41:45.196 3659 3669 D ActivityManager: setAppIconInfo(), x : 378, y : 316, width : 336, height : 416, isHomeItem : false
06-14 13:41:45.286 3659 5021 D ActivityManager: post active user change for 0 fullscreen true isHomeActivity() false
06-14 13:41:45.330 2125 2141 I Unity : MemoryManager: Using 'Dynamic Heap' Allocator.
06-14 13:41:45.346 2125 2141 I Unity : SystemInfo CPU = ARM64 FP ASIMD AES, Cores = 8, Memory = 3537mb
06-14 13:41:45.346 2125 2141 I Unity :
06-14 13:41:45.346 2125 2141 I Unity : SystemInfo ARM big.LITTLE configuration: 4 big (mask: 0xf0), 4 little (mask: 0xf)
06-14 13:41:45.346 2125 2141 I Unity :
06-14 13:41:45.347 2125 2141 I Unity : ApplicationInfo com.RWTHAR.ARWindfarm version 0.1 build 0468ba26-5011-46ff-8cf0-7fbc494d0cf4
06-14 13:41:45.347 2125 2141 I Unity :
06-14 13:41:45.347 2125 2141 I Unity : Built from '2020.3/staging' branch, Version '2020.3.4f1 (0abb6314276a)', Build type 'Release', Scripting Backend 'mono', CPU 'armeabi-v7a', Stripping 'Disabled'
06-14 13:41:45.347 2125 2141 I Unity :
06-14 13:41:45.383 3659 3699 I ActivityManager: Displayed com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity: +184ms
06-14 13:41:45.447 2125 2141 I Unity : Company Name: RWTH AR
06-14 13:41:45.447 2125 2141 I Unity : Product Name: AR Windfarm
06-14 13:41:45.459 2125 2141 D Unity : GL_EXT_debug_marker GL_ARM_rgba8 GL_ARM_mali_shader_binary GL_OES_depth24 GL_OES_depth_texture GL_OES_depth_texture_cube_map GL_OES_packed_depth_stencil GL_OES_rgb8_rgba8 GL_EXT_read_format_bgra GL_OES_compressed_paletted_texture GL_OES_compressed_ETC1_RGB8_texture GL_OES_standard_derivatives GL_OES_EGL_image GL_OES_EGL_image_external GL_OES_EGL_image_external_essl3 GL_OES_EGL_sync GL_OES_texture_npot GL_OES_vertex_half_float GL_OES_required_internalformat GL_OES_vertex_array_object GL_OES_mapbuffer GL_EXT_texture_format_BGRA8888 GL_EXT_texture_rg GL_EXT_texture_type_2_10_10_10_REV GL_OES_fbo_render_mipmap GL_OES_element_index_uint GL_EXT_shadow_samplers GL_OES_texture_compression_astc GL_KHR_texture_compression_astc_ldr GL_KHR_texture_compression_astc_hdr GL_KHR_texture_compression_astc_sliced_3d GL_KHR_debug GL_EXT_occlusion_query_boolean GL_EXT_disjoint_timer_query GL_EXT_blend_minmax GL_EXT_discard_framebuffer GL_OES_get_program_binary GL_OES_texture_3D GL_EXT_texture_storage GL_EXT_multisampled_render_
06-14 13:41:45.459 2125 2141 D Unity : to_texture GL_OES_surfaceless_context GL_OES_texture_stencil8 GL_EXT_shader_pixel_local_storage GL_ARM_shader_framebuffer_fetch GL_ARM_shader_framebuffer_fetch_depth_stencil GL_ARM_mali_program_binary GL_EXT_sRGB GL_EXT_sRGB_write_control GL_EXT_texture_sRGB_decode GL_EXT_texture_sRGB_R8 GL_EXT_texture_sRGB_RG8 GL_KHR_blend_equation_advanced GL_KHR_blend_equation_advanced_coherent GL_OES_texture_storage_multisample_2d_array GL_OES_shader_image_atomic GL_EXT_robustness GL_EXT_draw_buffers_indexed GL_OES_draw_buffers_indexed GL_EXT_texture_border_clamp GL_OES_texture_border_clamp GL_EXT_texture_cube_map_array GL_OES_texture_cube_map_array GL_OES_sample_variables GL_OES_sample_shading GL_OES_shader_multisample_interpolation GL_EXT_shader_io_blocks GL_OES_shader_io_blocks GL_EXT_tessellation_shader GL_OES_tessellation_shader GL_EXT_primitive_bounding_box GL_OES_primitive_bounding_box GL_EXT_geometry_shader GL_OES_geometry_shader GL_ANDROID_extension_pack_es31a GL_EXT_gpu_shader5 GL_OES_gpu_shader5 GL_EXT_texture
06-14 13:41:45.459 2125 2141 D Unity : _buffer GL_OES_texture_buffer GL_EXT_copy_image GL_OES_copy_image GL_EXT_shader_non_constant_global_initializers GL_EXT_color_buffer_half_float GL_EXT_color_buffer_float GL_EXT_YUV_target GL_OVR_multiview GL_OVR_multiview2 GL_OVR_multiview_multisampled_render_to_texture GL_KHR_robustness GL_KHR_robust_buffer_access_behavior GL_EXT_draw_elements_base_vertex GL_OES_draw_elements_base_vertex GL_EXT_protected_textures GL_EXT_buffer_storage
06-14 13:41:46.123 2125 2141 I Unity : XRGeneral Settings awakening...
06-14 13:41:46.895 2125 2193 W Unity :
06-14 13:41:49.298 2125 2141 I Unity : Using fake GPS location lat:50 lon:6
06-14 13:41:49.298 2125 2141 I Unity :
06-14 13:41:49.903 2125 2141 E Unity : Material 'TestMaterial' with Shader 'Custom/TranspUnlit' doesn't have a texture property '_MainTex'
06-14 13:41:49.903 2125 2141 E Unity :
06-14 13:41:51.605 2125 2141 E Unity : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.605 2125 2141 E Unity : at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.605 2125 2141 E Unity : at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.605 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.605 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.605 2125 2141 E Unity :
06-14 13:41:51.650 2125 2141 E Unity : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.650 2125 2141 E Unity : at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.650 2125 2141 E Unity : at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.650 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.650 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.650 2125 2141 E Unity :
06-14 13:41:51.673 2125 2141 E Unity : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.673 2125 2141 E Unity : at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.673 2125 2141 E Unity : at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.673 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.673 2125 2141 E Unity : at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.673 2125 2141 E Unity :
06-14 13:41:52.928 2125 2141 E Unity : -------- GLSL link error: Max number of total work group invocations exceeded.
06-14 13:41:52.928 2125 2141 E Unity :
06-14 13:41:52.928 2125 2141 E Unity :
06-14 13:41:52.928 2125 2141 E Unity :
06-14 13:41:52.928 2125 2141 E Unity : ERROR: Unable to link compute shader: Conv2dA_NHWC.Conv2DKernelKxK_T16x16_R4x4_NHWC
06-14 13:41:52.928 2125 2141 E Unity :
06-14 13:41:53.383 3659 5021 W ActivityManager: crash : com.RWTHAR.ARWindfarm,0
06-14 13:41:53.388 3659 5021 W ActivityManager: Force finishing activity com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity
06-14 13:41:53.447 3659 3685 I ActivityManager: Showing crash dialog for package com.RWTHAR.ARWindfarm u0
06-14 13:41:53.538 3659 6492 D PackageManager: getSelectedMetaData : packageName(com.RWTHAR.ARWindfarm) or Metadata strings {[Ljava.lang.String;@5511a13}
06-14 13:41:53.912 3659 3684 W ActivityManager: Activity pause timeout for ActivityRecord{a8a192a u0 com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity t8422 f}
06-14 13:41:53.913 3659 3684 D ActivityManager: isScaleDownAnimationEnabled() : true
06-14 13:41:53.913 3659 3684 D ActivityManager: clearAppIconInfo()
06-14 13:41:53.913 3659 3684 D ActivityManager: applyOptionsLocked, ANIM_CUSTOM_SCALE_DOWN
06-14 13:41:53.953 3659 4146 D ActivityManager: post active user change for 0 fullscreen true isHomeActivity() true
06-14 13:41:57.378 3659 5021 W ActivityManager: Force finishing activity com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity
06-14 13:41:57.410 3659 5021 I ActivityManager: Killing 2125:com.RWTHAR.ARWindfarm/u0a334 (adj 900): crash
06-14 13:41:57.611 3659 3699 W ActivityManager: setHasOverlayUi called on unknown pid: 2125
06-14 13:41:59.039 3659 5021 D ActivityManager: setLockScreenShown(true) is called from 4060
We are currently working on plotting the execution time for each frame, will let you know if an execution plan minimizes our issue!
Thanks for the logs, it seems this is running using OpenGLES2.0, we don't actually support it atm because of driver quality in term of compute shader, have you tried Vulkan as a gfx device?
Also app might be killed by OS because of a high memory usage, could you try to monitor that?
The app is running in ARCore which doesn't support Vulkan so far... Are there any other supported platforms for GPU inference on Android which you would recommend?
Taking a look at your callstack, the culprit doesn't seems to be Vulkan or OpenGL but
06-14 13:41:52.928 2125 2141 E Unity : -------- GLSL link error: Max number of total work group invocations exceeded.
Conv2dA_NHWC.Conv2DKernelKxK_T16x16_R4x4_NHWC
Could you share the onnx model? Also which device are you testing on?
Thanks @AlexRibard for looking into our issue aswell! This is our onnx model, I am testing on a Samsung Galaxy S7.
Ok sorry about the false lead. I tested your model on OnePlus A6010 on Vulkan, It runs very slow but it runs (680ms for the whole model every frame). Memory usage is a bit high (200MB) So depending of what else you are doing, it might exceed memory usage or be too slow so that the driver kills the app.
On OpenGLES-3 execution is even slower and the app crashes. To make it run decently I had to split up execution and run one layer per frame. You can split up execution as follows:
var schedule = worker.StartManualSchedule(I);
...
bool hasMoreWork = schedule.MoveNext();
if (hasMoreWork == false)
var ouptut = m_worker.PeekOutput();
Did you run the model on the GPU or on the CPU when executing with OpenGLES-3? I am wondering because Barracuda doesn't support OpenGLES-3 platform for the GPU inference right? However, when testing the model in my AR app the calculation of the whole model takes approx. 7s which is much longer than the 680ms with your tests. Is it possible that the processing unit makes such a big difference? Or must there be another explanation why the execution of the model on my device is so slow?
I was running with ComputePrecompiled
worker so GPU.
It runs on OpenGLES-3, but the driver support is very spotty. Even on my phone it was slower compared to Vulkan so it goes to show that it is not reliable.
Why it is slow on your device, that could be because of the driver support being spotty, or the hardware being less good...
Lots of reasons.
I would do a single app running in Vulkan with only your model and check the perf. See if it is good or not.
If it is slower than it should be, do raise it up. (But do note that on OpenGLES-3 we won't be of help).
I suggest using a less expensive model or splitting the execution over a few frames like I mentioned.
https://github.com/Unity-Technologies/barracuda-release/blob/release/1.0.0/Documentation~/ModelExecution.md
I had already split up the execution of the model over a few frames from the beginning, now I also specified some layers that can be executed together depending on how much time they take as @FlorentGuinier suggested. This helped to speed up the execution significantly, it is nearly twice as fast now.
When testing with Vulkan on the GPU, different devices had very different execution times (from 0.4 to 21 seconds). I am not sure why this is the case, however this is mostly irrelevant now because ARCore doesn't support Vulkan anyway. On ARKit the model runs with Metal on the GPU, still taking about a second or more for the whole model. However you don't seem to have any further tips for speeding up the execution than choosing a less expensive model, do you?
In any case, thanks a lot @FlorentGuinier and @AlexRibard for the time and effort you put into answering my questions!
Could you try with a ComputeRef
worker on mobile? See if that helps a bit. A lot of time, simpler code runs faster on mobile.
If that doesn't help, I would reduce the number of channels or kernel sizes in your convolutions. Or replace them with depthwise conv.
We are working on faster kernels for mobile. They are in the works but will take a month or so to come out.
Closing this issue as there is/has been no activity for some time. Please reopen if needed.
Hi,
I am implementing far-distance gameobject occlusion in an AR app. For this, I have managed to include the DDRNet-23-Slim real-time semantic segmentation network into Unity using Barracuda, having the mobile camera picture as the input into the CNN and then passing the calculated output as a Texture2D on to a shader which occludes those gameobject pixels that don't pass a certain condition.
When running the app, I am now troubled with a relatively slow execution of calculating the CNN output whilst the shader is rendering the gameobjects nearly instantaneously. This leads to a very noticeable time-lag between the mobile camera picture being taken to the actual occlusion of the gameobjects, sometimes the app even crashes. As the DDRNet-23-Slim network is one of the fastest networks available for semantic segmentation (although only tested on high-performance GPU cf. https://paperswithcode.com/sota/real-time-semantic-segmentation-on-cityscapes), I am wondering if there is any way to optimize my integration of the cnn into Barracuda?
As proposed in the FAQ, I am currently executing the model on the CPU (using CSharpBurst). I am also scheduling the execution using the ExecuteAsync() method calculating only one layer per frame, more layers per frame generate pauses which are too long. From issue #16 I assume there still is no way to implement the CNN calculation on another thread? Instead of creating a RenderTexture with Barracudas TensorToRendertexture(), I am currently iterating over each pixel to create a Texture2D to pass on to the shader because I am only using a single channel from the CNN output. I am also throwing the sigmoid function on each calculated output for normalisation. Could this act as a bottleneck, is there any way to convert a tensor to Texture2D more efficiently?
Do you generally have any ideas on how to optimize my implementation? Any help would be greatly appreciated!
This is my source code: