Unity-Technologies / barracuda-release

Other
562 stars 77 forks source link

Slow execution of real-time semantic segmentation CNN #192

Closed avinnus closed 3 years ago

avinnus commented 3 years ago

Hi,

I am implementing far-distance gameobject occlusion in an AR app. For this, I have managed to include the DDRNet-23-Slim real-time semantic segmentation network into Unity using Barracuda, having the mobile camera picture as the input into the CNN and then passing the calculated output as a Texture2D on to a shader which occludes those gameobject pixels that don't pass a certain condition.

When running the app, I am now troubled with a relatively slow execution of calculating the CNN output whilst the shader is rendering the gameobjects nearly instantaneously. This leads to a very noticeable time-lag between the mobile camera picture being taken to the actual occlusion of the gameobjects, sometimes the app even crashes. As the DDRNet-23-Slim network is one of the fastest networks available for semantic segmentation (although only tested on high-performance GPU cf. https://paperswithcode.com/sota/real-time-semantic-segmentation-on-cityscapes), I am wondering if there is any way to optimize my integration of the cnn into Barracuda?

As proposed in the FAQ, I am currently executing the model on the CPU (using CSharpBurst). I am also scheduling the execution using the ExecuteAsync() method calculating only one layer per frame, more layers per frame generate pauses which are too long. From issue #16 I assume there still is no way to implement the CNN calculation on another thread? Instead of creating a RenderTexture with Barracudas TensorToRendertexture(), I am currently iterating over each pixel to create a Texture2D to pass on to the shader because I am only using a single channel from the CNN output. I am also throwing the sigmoid function on each calculated output for normalisation. Could this act as a bottleneck, is there any way to convert a tensor to Texture2D more efficiently?

Do you generally have any ideas on how to optimize my implementation? Any help would be greatly appreciated!

This is my source code:

using System;
using System.Collections;
using Unity.Collections;
using Unity.Collections.LowLevel.Unsafe;
using UnityEngine;
using UnityEngine.XR.ARFoundation;
using UnityEngine.XR.ARSubsystems;
using Unity.Barracuda;

public class CameraMask : MonoBehaviour
{
    ARCameraManager cameraManager; 
    Texture2D m_Texture;
    public NNModel modelAsset;
    private Model m_RuntimeModel;

    public Material m_MaterialRed;
    public Material m_MaterialGray;

    public Texture2D confMap;
    private Tensor input;
    private IWorker worker;
    private IEnumerator executor=null;

    void Start() 
    {
        m_RuntimeModel = ModelLoader.Load(modelAsset);
    }

    void OnEnable()
    {
        if (cameraManager==null)
            cameraManager=gameObject.GetComponent(typeof(ARCameraManager)) as ARCameraManager;
        if (cameraManager != null)
        {
            cameraManager.frameReceived += OnCameraFrameReceived;
        }
    }

    void OnDisable()
    {
        if (cameraManager != null)
        {
            cameraManager.frameReceived -= OnCameraFrameReceived;
        }
    }

    unsafe void createWorker() 
    {
        if (!cameraManager.TryAcquireLatestCpuImage(out XRCpuImage image))
        {
            executor=null;
            return;
        }

        var conversionParams = new XRCpuImage.ConversionParams
        {
            // Get the entire image.
            inputRect = new RectInt(0, 0, image.width, image.height),

            // Choose output dimensions to match input dimensions of CNN.
            outputDimensions = new Vector2Int(image.width, image.height),

            // Choose RGBA format.
            outputFormat = TextureFormat.RGBA32,

            // Flip across the vertical axis (mirror image).
            transformation = XRCpuImage.Transformation.MirrorY
        };

        // See how many bytes you need to store the final image.
        int size = image.GetConvertedDataSize(conversionParams);

        // Allocate a buffer to store the image.
        var buffer = new NativeArray<byte>(size, Allocator.Temp);

        // Extract the image data
        image.Convert(conversionParams, new IntPtr(buffer.GetUnsafePtr()), buffer.Length);

        // The image was converted to RGBA32 format and written into the provided buffer
        // so you can dispose of the XRCpuImage. You must do this or it will leak resources.
        image.Dispose();

        // Put data into a texture to visualize it.
        m_Texture = new Texture2D(
            conversionParams.outputDimensions.x,
            conversionParams.outputDimensions.y,
            conversionParams.outputFormat,
            false);

        m_Texture.LoadRawTextureData(buffer);
        m_Texture.Apply();

        // Scale camera picture to CNN input size
        TextureScale.Bilinear(m_Texture, 1024, 512);

        worker = WorkerFactory.CreateWorker(WorkerFactory.Type.CSharpBurst,m_RuntimeModel);

        // Texture inputs.
        int channelCount = 3; // you can treat input pixels as 1 (grayscale), 3 (color) or 4 (color with alpha) channels
        input = new Tensor(m_Texture, channelCount); 

        executor=worker.ExecuteAsync(input);
        buffer.Dispose();
    }

    unsafe void OnCameraFrameReceived(ARCameraFrameEventArgs eventArgs)
    {
        if (executor==null) 
        {
            createWorker();
        }

        bool hasWork = false;
        int maxLayersPerTick = 1; // Specify how many layers of your network to execute per frame

        do
        {  
            hasWork = executor.MoveNext();           
        } while (hasWork && --maxLayersPerTick > 0);

        if (!hasWork)
        {
            Tensor mask = worker.PeekOutput(); // Put (output name) if there are multiple outputs in model

            // Write output in Texture2D
            Color[] cols=new Color[64*128];
            var k = 0;
            for (int j=0; j<128; j++)
            {
                for(int i=0; i<64; i++)
                {
                    var tmp = Convert.ToSingle(1 / (1 + Math.Exp(-mask[0, i, j, 10])));
                    cols[k] = new Color(tmp,tmp,tmp,1);
                    k++;
                }
            }
            Texture2D confidence = new Texture2D(64, 128);
            confidence.SetPixels(cols);
            confidence.Apply();

            TextureScale.Bilinear(confidence, Screen.width, Screen.height);

            m_MaterialGray.SetTexture("_ConfMap", confidence);
            m_MaterialRed.SetTexture("_ConfMap", confidence);

            confMap = confidence;
            mask.Dispose(); 
            worker.Dispose();
            input.Dispose();

            createWorker();
        }
        else
        {
            worker.WaitForCompletion(); 
        }
    }

    Texture2D toTexture2D(RenderTexture rTex)
    {
        Texture2D tex = new Texture2D(128, 64, TextureFormat.RGBA32, false);
        // ReadPixels looks at the active RenderTexture.
        RenderTexture.active = rTex;
        tex.ReadPixels(new Rect(0, 0, rTex.width, rTex.height), 0, 0);
        tex.Apply();
        return tex;
    }
}
FlorentGuinier commented 3 years ago

Hi @avinnus

A bunch of questions to start the discution:

Have you tried PrecompiledCompute backend ->

Some layer are probably much more expensive than other, it would probably be interresting to profile each layer individually and create an execution plan where you schedule a few layer per frame when they are not expensive, this would probably improve total inference time/latency (cache coherency might help there).

Hope it helps! Florent

avinnus commented 3 years ago

Thanks for your quick reply!

I have tried running the CNN on the GPU before, however this results in the app crashing down directly when opening. Do you have any idea why this is the case? These are all activities that occur when opening the app with ComputePrecompiled, my device has a Mali-T880 MP12 GPU:

06-14 13:41:45.196  3659  3669 D ActivityManager: setAppIconInfo(), x : 378, y : 316, width : 336, height : 416, isHomeItem : false
06-14 13:41:45.286  3659  5021 D ActivityManager: post active user change for 0 fullscreen true isHomeActivity() false
06-14 13:41:45.330  2125  2141 I Unity   : MemoryManager: Using 'Dynamic Heap' Allocator.
06-14 13:41:45.346  2125  2141 I Unity   : SystemInfo CPU = ARM64 FP ASIMD AES, Cores = 8, Memory = 3537mb
06-14 13:41:45.346  2125  2141 I Unity   :
06-14 13:41:45.346  2125  2141 I Unity   : SystemInfo ARM big.LITTLE configuration: 4 big (mask: 0xf0), 4 little (mask: 0xf)
06-14 13:41:45.346  2125  2141 I Unity   :
06-14 13:41:45.347  2125  2141 I Unity   : ApplicationInfo com.RWTHAR.ARWindfarm version 0.1 build 0468ba26-5011-46ff-8cf0-7fbc494d0cf4
06-14 13:41:45.347  2125  2141 I Unity   :
06-14 13:41:45.347  2125  2141 I Unity   : Built from '2020.3/staging' branch, Version '2020.3.4f1 (0abb6314276a)', Build type 'Release', Scripting Backend 'mono', CPU 'armeabi-v7a', Stripping 'Disabled'
06-14 13:41:45.347  2125  2141 I Unity   :
06-14 13:41:45.383  3659  3699 I ActivityManager: Displayed com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity: +184ms
06-14 13:41:45.447  2125  2141 I Unity   : Company Name: RWTH AR
06-14 13:41:45.447  2125  2141 I Unity   : Product Name: AR Windfarm
06-14 13:41:45.459  2125  2141 D Unity   :  GL_EXT_debug_marker GL_ARM_rgba8 GL_ARM_mali_shader_binary GL_OES_depth24 GL_OES_depth_texture GL_OES_depth_texture_cube_map GL_OES_packed_depth_stencil GL_OES_rgb8_rgba8 GL_EXT_read_format_bgra GL_OES_compressed_paletted_texture GL_OES_compressed_ETC1_RGB8_texture GL_OES_standard_derivatives GL_OES_EGL_image GL_OES_EGL_image_external GL_OES_EGL_image_external_essl3 GL_OES_EGL_sync GL_OES_texture_npot GL_OES_vertex_half_float GL_OES_required_internalformat GL_OES_vertex_array_object GL_OES_mapbuffer GL_EXT_texture_format_BGRA8888 GL_EXT_texture_rg GL_EXT_texture_type_2_10_10_10_REV GL_OES_fbo_render_mipmap GL_OES_element_index_uint GL_EXT_shadow_samplers GL_OES_texture_compression_astc GL_KHR_texture_compression_astc_ldr GL_KHR_texture_compression_astc_hdr GL_KHR_texture_compression_astc_sliced_3d GL_KHR_debug GL_EXT_occlusion_query_boolean GL_EXT_disjoint_timer_query GL_EXT_blend_minmax GL_EXT_discard_framebuffer GL_OES_get_program_binary GL_OES_texture_3D GL_EXT_texture_storage GL_EXT_multisampled_render_
06-14 13:41:45.459  2125  2141 D Unity   : to_texture GL_OES_surfaceless_context GL_OES_texture_stencil8 GL_EXT_shader_pixel_local_storage GL_ARM_shader_framebuffer_fetch GL_ARM_shader_framebuffer_fetch_depth_stencil GL_ARM_mali_program_binary GL_EXT_sRGB GL_EXT_sRGB_write_control GL_EXT_texture_sRGB_decode GL_EXT_texture_sRGB_R8 GL_EXT_texture_sRGB_RG8 GL_KHR_blend_equation_advanced GL_KHR_blend_equation_advanced_coherent GL_OES_texture_storage_multisample_2d_array GL_OES_shader_image_atomic GL_EXT_robustness GL_EXT_draw_buffers_indexed GL_OES_draw_buffers_indexed GL_EXT_texture_border_clamp GL_OES_texture_border_clamp GL_EXT_texture_cube_map_array GL_OES_texture_cube_map_array GL_OES_sample_variables GL_OES_sample_shading GL_OES_shader_multisample_interpolation GL_EXT_shader_io_blocks GL_OES_shader_io_blocks GL_EXT_tessellation_shader GL_OES_tessellation_shader GL_EXT_primitive_bounding_box GL_OES_primitive_bounding_box GL_EXT_geometry_shader GL_OES_geometry_shader GL_ANDROID_extension_pack_es31a GL_EXT_gpu_shader5 GL_OES_gpu_shader5 GL_EXT_texture
06-14 13:41:45.459  2125  2141 D Unity   : _buffer GL_OES_texture_buffer GL_EXT_copy_image GL_OES_copy_image GL_EXT_shader_non_constant_global_initializers GL_EXT_color_buffer_half_float GL_EXT_color_buffer_float GL_EXT_YUV_target GL_OVR_multiview GL_OVR_multiview2 GL_OVR_multiview_multisampled_render_to_texture GL_KHR_robustness GL_KHR_robust_buffer_access_behavior GL_EXT_draw_elements_base_vertex GL_OES_draw_elements_base_vertex GL_EXT_protected_textures GL_EXT_buffer_storage
06-14 13:41:46.123  2125  2141 I Unity   : XRGeneral Settings awakening...
06-14 13:41:46.895  2125  2193 W Unity   :
06-14 13:41:49.298  2125  2141 I Unity   : Using fake GPS location lat:50 lon:6
06-14 13:41:49.298  2125  2141 I Unity   :
06-14 13:41:49.903  2125  2141 E Unity   : Material 'TestMaterial' with Shader 'Custom/TranspUnlit' doesn't have a texture property '_MainTex'
06-14 13:41:49.903  2125  2141 E Unity   :
06-14 13:41:51.605  2125  2141 E Unity   : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.605  2125  2141 E Unity   :   at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.605  2125  2141 E Unity   :   at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.605  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.605  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.605  2125  2141 E Unity   :
06-14 13:41:51.650  2125  2141 E Unity   : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.650  2125  2141 E Unity   :   at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.650  2125  2141 E Unity   :   at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.650  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.650  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.650  2125  2141 E Unity   :
06-14 13:41:51.673  2125  2141 E Unity   : NullReferenceException: Object reference not set to an instance of an object
06-14 13:41:51.673  2125  2141 E Unity   :   at CameraMask.OnCameraFrameReceived (UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs eventArgs) [0x00012] in <bfa63440340f423390beb30b841d3e09>:0
06-14 13:41:51.673  2125  2141 E Unity   :   at (wrapper delegate-invoke) System.Action`1[UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs].invoke_void_T(UnityEngine.XR.ARFoundation.ARCameraFrameEventArgs)
06-14 13:41:51.673  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.InvokeFrameReceivedEvent (UnityEngine.XR.ARSubsystems.XRCameraFrame frame) [0x0033c] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.673  2125  2141 E Unity   :   at UnityEngine.XR.ARFoundation.ARCameraManager.Update () [0x000b8] in <21774a90805648be8730bf415352cbc2>:0
06-14 13:41:51.673  2125  2141 E Unity   :
06-14 13:41:52.928  2125  2141 E Unity   : -------- GLSL link error:  Max number of total work group invocations exceeded.
06-14 13:41:52.928  2125  2141 E Unity   :
06-14 13:41:52.928  2125  2141 E Unity   :
06-14 13:41:52.928  2125  2141 E Unity   :
06-14 13:41:52.928  2125  2141 E Unity   : ERROR: Unable to link compute shader: Conv2dA_NHWC.Conv2DKernelKxK_T16x16_R4x4_NHWC
06-14 13:41:52.928  2125  2141 E Unity   :
06-14 13:41:53.383  3659  5021 W ActivityManager: crash : com.RWTHAR.ARWindfarm,0
06-14 13:41:53.388  3659  5021 W ActivityManager:   Force finishing activity com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity
06-14 13:41:53.447  3659  3685 I ActivityManager: Showing crash dialog for package com.RWTHAR.ARWindfarm u0
06-14 13:41:53.538  3659  6492 D PackageManager: getSelectedMetaData : packageName(com.RWTHAR.ARWindfarm) or Metadata strings {[Ljava.lang.String;@5511a13}
06-14 13:41:53.912  3659  3684 W ActivityManager: Activity pause timeout for ActivityRecord{a8a192a u0 com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity t8422 f}
06-14 13:41:53.913  3659  3684 D ActivityManager: isScaleDownAnimationEnabled() : true
06-14 13:41:53.913  3659  3684 D ActivityManager: clearAppIconInfo()
06-14 13:41:53.913  3659  3684 D ActivityManager: applyOptionsLocked, ANIM_CUSTOM_SCALE_DOWN
06-14 13:41:53.953  3659  4146 D ActivityManager: post active user change for 0 fullscreen true isHomeActivity() true
06-14 13:41:57.378  3659  5021 W ActivityManager:   Force finishing activity com.RWTHAR.ARWindfarm/com.unity3d.player.UnityPlayerActivity
06-14 13:41:57.410  3659  5021 I ActivityManager: Killing 2125:com.RWTHAR.ARWindfarm/u0a334 (adj 900): crash
06-14 13:41:57.611  3659  3699 W ActivityManager: setHasOverlayUi called on unknown pid: 2125
06-14 13:41:59.039  3659  5021 D ActivityManager: setLockScreenShown(true) is called from 4060

We are currently working on plotting the execution time for each frame, will let you know if an execution plan minimizes our issue!

FlorentGuinier commented 3 years ago

Thanks for the logs, it seems this is running using OpenGLES2.0, we don't actually support it atm because of driver quality in term of compute shader, have you tried Vulkan as a gfx device?

Also app might be killed by OS because of a high memory usage, could you try to monitor that?

avinnus commented 3 years ago

The app is running in ARCore which doesn't support Vulkan so far... Are there any other supported platforms for GPU inference on Android which you would recommend?

AlexRibard commented 3 years ago

Taking a look at your callstack, the culprit doesn't seems to be Vulkan or OpenGL but

06-14 13:41:52.928  2125  2141 E Unity   : -------- GLSL link error:  Max number of total work group invocations exceeded.
Conv2dA_NHWC.Conv2DKernelKxK_T16x16_R4x4_NHWC

Could you share the onnx model? Also which device are you testing on?

avinnus commented 3 years ago

Thanks @AlexRibard for looking into our issue aswell! This is our onnx model, I am testing on a Samsung Galaxy S7.

AlexRibard commented 3 years ago

Ok sorry about the false lead. I tested your model on OnePlus A6010 on Vulkan, It runs very slow but it runs (680ms for the whole model every frame). Memory usage is a bit high (200MB) So depending of what else you are doing, it might exceed memory usage or be too slow so that the driver kills the app.

On OpenGLES-3 execution is even slower and the app crashes. To make it run decently I had to split up execution and run one layer per frame. You can split up execution as follows:

var schedule = worker.StartManualSchedule(I);
...
bool hasMoreWork = schedule.MoveNext();

if (hasMoreWork == false)
    var ouptut = m_worker.PeekOutput();
avinnus commented 3 years ago

Did you run the model on the GPU or on the CPU when executing with OpenGLES-3? I am wondering because Barracuda doesn't support OpenGLES-3 platform for the GPU inference right? However, when testing the model in my AR app the calculation of the whole model takes approx. 7s which is much longer than the 680ms with your tests. Is it possible that the processing unit makes such a big difference? Or must there be another explanation why the execution of the model on my device is so slow?

AlexRibard commented 3 years ago

I was running with ComputePrecompiled worker so GPU. It runs on OpenGLES-3, but the driver support is very spotty. Even on my phone it was slower compared to Vulkan so it goes to show that it is not reliable. Why it is slow on your device, that could be because of the driver support being spotty, or the hardware being less good... Lots of reasons. I would do a single app running in Vulkan with only your model and check the perf. See if it is good or not. If it is slower than it should be, do raise it up. (But do note that on OpenGLES-3 we won't be of help). I suggest using a less expensive model or splitting the execution over a few frames like I mentioned. https://github.com/Unity-Technologies/barracuda-release/blob/release/1.0.0/Documentation~/ModelExecution.md

avinnus commented 3 years ago

I had already split up the execution of the model over a few frames from the beginning, now I also specified some layers that can be executed together depending on how much time they take as @FlorentGuinier suggested. This helped to speed up the execution significantly, it is nearly twice as fast now.

When testing with Vulkan on the GPU, different devices had very different execution times (from 0.4 to 21 seconds). I am not sure why this is the case, however this is mostly irrelevant now because ARCore doesn't support Vulkan anyway. On ARKit the model runs with Metal on the GPU, still taking about a second or more for the whole model. However you don't seem to have any further tips for speeding up the execution than choosing a less expensive model, do you?

In any case, thanks a lot @FlorentGuinier and @AlexRibard for the time and effort you put into answering my questions!

AlexRibard commented 3 years ago

Could you try with a ComputeRef worker on mobile? See if that helps a bit. A lot of time, simpler code runs faster on mobile. If that doesn't help, I would reduce the number of channels or kernel sizes in your convolutions. Or replace them with depthwise conv. We are working on faster kernels for mobile. They are in the works but will take a month or so to come out.

amirebrahimi commented 3 years ago

Closing this issue as there is/has been no activity for some time. Please reopen if needed.