Confusion between Device and Type in the parameters of CreateWorker function

aryansaurav commented 2 years ago

Hello,

The CreateWorker() is overloaded with several definitions. But, some of them require Device (GPU/CPU) as input while others require Type as in input (ComputePrecompile/Burst/ etc), some of which are supposed to work on GPU.

I tried both definitions with the aim of using GPU for doing computations on image from the camera. But, the one with Device set to GPU did not work on Android (crashes every time) although works fine on Windows. The definition with Type specified as ComputePrecompiled worked on Android, following the syntax given in the barracuda tutorials.

My question is what's the difference between the two? If the definition with type specified as GPU is the right way to use barracuda on GPU, is there any specific use of the CreateWorker() definition with Device specified GPU?

If particular application is relevant, I am trying to feed camera live video into neural networks using barracuda.

Thanks!

FlorentGuinier commented 2 years ago

Hi @aryansaurav, Device as GPU will try to run on the hardware GPU using the best way considering the device capabilities. On barracuda 2.4.0 if the device does not support compute shader it will then run using the PixelShader backend (Type as opposed to Device). Could you specify the graphics API you are using (Vulkan/OpenGL) as well as the version of Barracuda please?

Thanks! Florent

aryansaurav commented 2 years ago

Thanks @FlorentGuinier for your reply. I am using barracuda 2.4.0 and have tried both Vulkan/GLES (3.1/3.2). None of them work on Android if I set device to GPU. The app crashes after 4-5 frames (seen on Profiler). However, if I set Type to ComputePrecompiled using the other definition, it works but quite slowly!

Further, on Windows, everything works (NVDA RTX graphics card) but it is very slow if I set Type to any of the given options. If I just use CreateWorker() without any parameters, it's much much faster. Still do not understand what the device and type do as far as GPU usage is concerned. Is setting Type to ComputePrecompiled really meant to use GPU? It runs too slow!

FlorentGuinier commented 2 years ago

Hi @aryansaurav Device is a simple user friendly way to remap to Type here is the documentation: https://docs.unity3d.com/Packages/com.unity.barracuda@2.4/manual/Worker.html

There should be no difference past this remaping!

For example on 2.4.0 on both Android and Windows GPU and Auto should be equivalent to ComputePrecompiled.

See https://github.com/Unity-Technologies/barracuda-release/blob/b1eac6c34b12b1ef5506fb7121a29eda2997efd1/Barracuda/Runtime/Core/Backends/BarracudaBackendsFactory.cs#L19

    internal static WorkerFactory.Type GetBestTypeForDevice(WorkerFactory.Device device)
    {
        switch (device)
        {
            case WorkerFactory.Device.Auto:
            case WorkerFactory.Device.GPU:
                return WorkerFactory.Type.ComputePrecompiled;
            default:
                return WorkerFactory.Type.CSharpBurst;
        }
    }

We however have the newer PixelShader backend to run on GPU when compute shader are not available, see https://github.com/Unity-Technologies/barracuda-release/blob/b1eac6c34b12b1ef5506fb7121a29eda2997efd1/Barracuda/Runtime/Core/Backends/BarracudaBackendsFactory.cs#L31

    internal static WorkerFactory.Type ValidateType(WorkerFactory.Type type)
    {
        type = ResolveAutoType(type);
        Assert.AreNotEqual(type, WorkerFactory.Type.Auto);

        if (WorkerFactory.IsType(type, WorkerFactory.Device.GPU) && !ComputeShaderSingleton.Instance.supported)
        {
            type = WorkerFactory.Type.PixelShader;
        }

        return type;
    }

I hope it clarify both Device and Type ? However some of the behavior you have seen seems unexpected indeed, maybe trying to use only Type to better target the problem would help understand the source?

Florent

aryansaurav commented 2 years ago

Thanks for the details @FlorentGuinier . The code makes sense but in practice, there were some strange behaviors. If you would like to reproduce the issues I mentioned, you can take the package on Github here (from another user): https://github.com/keijiro/NNCam

More specifically, the user initializes worker in the file (line 16) https://github.com/keijiro/NNCam/blob/main/Assets/NNCam/SegmentationFilter.cs

Then, on Windows, I changed this line to initialize worker without any input arguments (like it is), or with Type as input or Device as input.. You will have to modify it somewhat like this: // _worker = ModelLoader.Load(_resources.model).CreateWorker(); _model = ModelLoader.Load(_resources.model); _worker = WorkerFactory.CreateWorker(WorkerFactory.Type.CSharpBurst, _model, true);

Issues on Windows: Now, if no input is provided in CreateWorker, it runs much faster than when Type is set to ComputePrecompiled or Device is GPU. But according to the explanations, these should be the same.

Issues on Android : If no input is provided in CreateWorker, then it simply does not work. If Type is set to Compute, it works but slowly. If device is set to GPU, it crashes. But, according to your explanations, these three should be essentially the same.

Please let me know if you need any additional information in reproducing this bug, or about the hardware specifications.

Aurimasp commented 2 years ago

Hi @aryansaurav,

Thanks for reporting the issue,

I was able to reproduce the slowdown. The issue is, that the Verbose mode slows down the execution. You should set it to false instead of true: _worker = WorkerFactory.CreateWorker(WorkerFactory.Type.ComputePrecompiled, _model, false);

I wasn't been able to reproduce the Android crash on my devices. Which Android device model do you use? Could you also try ComputePrecompiled instead of Compute Backend?

aryansaurav commented 2 years ago

Thanks @Aurimasp I actually just figured that out too , the verbose leads to slow down. Thanks a lot anyways to you both.

But, now there remains only issue with the crashing on Android.. And it does not happen when verbose is set to True (though it's very slow). If verbose is set to false then it crashes with the Type set to ComputePrecompiled. Device info: Android 11.0 on Snapdragon 690. (OnePlus Nord N10 5G) I wonder if it has too with the synchronous execution. I am using it in a coroutine but I understood today that it is not multi-threading.

I am pasting my code with coroutine below. Would appreciate any help.

using System.Collections; using System.Collections.Generic; using UnityEngine; using Unity.Barracuda;

namespace NNCam {

public class my_filtered_webcam : MonoBehaviour
{

    #region Variable declarations
    WebCamTexture _webcam;
    bool _bcameranotfound = false;
    public ResourceSet _resources;
    public RenderTexture _webcamrendertexture;
    RenderTexture _processed_texture;
    bool _bimageupdated = false;

    IWorker _worker;
    Model _model;

    const int Width = 640 + 1;
    const int Height = 352 + 1;

    #endregion
    // Start is called before the first frame update
    void Start()
    {

        WebCamDevice[] devices = WebCamTexture.devices;
        if (_webcam != null && _webcam.isPlaying)
        {
            _webcam.Stop();
        }

        if (devices != null && devices.Length != 0)
        {

            if(devices.Length>1)
                _webcam = new WebCamTexture(devices[1].name);
            else
                _webcam = new WebCamTexture(devices[0].name);

            _webcam.Play();
            _webcamrendertexture = new RenderTexture(_webcam.width, _webcam.height, 1);
            Graphics.Blit(_webcam, _webcamrendertexture);

            Debug.Log("Camera height, width: " + _webcam.height + ", " + _webcam.width);

        }
        else
        {
            Debug.Log("No camera found!");
            _bcameranotfound = true;
        }

        if (_resources != null)
        {
            _model = ModelLoader.Load(_resources.model);
            _worker = WorkerFactory.CreateWorker(WorkerFactory.Type.ComputePrecompiled, _model, false);

            if (_worker != null)
            {
                Debug.Log("NN model loaded successfully");
                StartCoroutine(ProcessImage());
            }
            else
                Debug.Log("Issue loading NN model in my_filtered_webcam script!");
        }
    }

    // Update is called once per frame
    void Update()
    {
        if (!_webcam.didUpdateThisFrame)
        {
            //Debug.Log("Camera frame not updated!");
            return;

        }
        else
        {
            Graphics.Blit(_webcam, _webcamrendertexture);

        }

    }

    private void OnDestroy()
    {
        Destroy(_webcam);
        Destroy(_webcamrendertexture);
        Destroy(_processed_texture);
        _worker?.Dispose();

    }

    IEnumerator ProcessImage()
    {

        // Preprocessing for BodyPix
        while (true)
        {

                using (var _imagetensor = new Tensor(_webcamrendertexture, 3))
                {
                    _worker.Execute(_imagetensor);
                }

                var output = _worker.PeekOutput("float_segments");
                yield return new WaitForCompletion(output);
                _processed_texture = output.ToRenderTexture(0, 0, 1.0f / 32, 0.5f);
                _bimageupdated = true;

        }
    }

    void OnRenderImage(RenderTexture source, RenderTexture destination)
    {
        if (true)
        {
            Graphics.Blit(_processed_texture, destination);
            _bimageupdated = false;
        }
    }
}

} // Namespace NNCam

Aurimasp commented 2 years ago

I was not able to reproduce the crash on older Android devices with the provided script. We don't have a Snapdragon 690 device at the moment, but we will order one. Would it be possible to check the crash logs for the time being? You can retrieve logs with the 'adb logcat' tool.

aryansaurav commented 2 years ago

Actually just fixed it.. it doesn't crash anymore using the asynchronous calling. Thanks a lot @Aurimasp and @FlorentGuinier for your help!

Unity-Technologies / barracuda-release

Confusion between Device and Type in the parameters of CreateWorker function #257