StereoKit / StereoKit

An easy-to-use XR engine for building AR and VR applications with C# and OpenXR!
https://stereokit.net/Pages/Guides/Getting-Started.html
MIT License
965 stars 122 forks source link

Unusual performance drops for `Tex.SetColors` & `Mesh.SetVerts` #455

Closed austinbhale closed 1 year ago

austinbhale commented 2 years ago

Description

When testing my HL2 application, I started to notice my color image sprites would occasionally have "performance hiccups" in their IStepper's Step function. If this function call takes too long in the main thread, it will block all the other ISteppers from rendering. I then noticed the inconsistency in performance for their texture's SetColors method. The main variables to take into consideration for quicker updates of a texture's color data seemed to be:

  1. Which architecture (ARM vs ARM64 for HL2 applications, PC as "ground truth")
  2. NoMips vs Dynamic textures

Platform / Environment

Test Program

A simple UWP C# single-file (Program.cs) application was used for testing:

namespace SetColorsSpeedTest
{
    using StereoKit;
    using StereoKit.Framework;
    using System;
    using System.Collections.Generic;
    using System.Diagnostics;

    public class ImagePlane : IStepper
    {
        private Tex tex;
        private Sprite sprite;
        private int imageWidth = 896;
        private int imageHeight = 504;
        private Pose windowPose = Matrix.T(new Vec3(0, 0, -1)).Pose;

        private Vec2 imageAspectRatio;

        private Stopwatch stopwatch = new Stopwatch();

        public ImagePlane()
        {
            this.tex = new Tex(TexType.Dynamic | TexType.ImageNomips, TexFormat.Rgba32);
            this.tex.SetColors(this.imageWidth, this.imageHeight, new Color32[this.imageWidth * this.imageHeight]);
            this.sprite = Sprite.FromTex(tex, SpriteType.Single);

            this.imageAspectRatio.x = 60.0f * U.cm;

            this.diagnosticTimes = new List<long>(numValues);
        }

        public bool Enabled => true;

        public bool Initialize()
        {
            return true;
        }

        public void Shutdown()
        {

        }

        // diagnostic states
        private static int numValues = 100;
        private int numOutliers = 5;

        private List<long> diagnosticTimes;

        public void Step()
        {
            if (this.diagnosticTimes.Count == numValues)
            {
                this.diagnosticTimes.Sort();

                // print diagnostics information
                Trace.WriteLine($"min: {this.diagnosticTimes[0]}ms, " +
                                $"max: {this.diagnosticTimes[numValues - 1]}ms, " +
                                $"median: {this.diagnosticTimes[numValues / 2 - 1]}ms");

                Trace.WriteLine($"printing top {numOutliers} longest times:");

                for (int i = numValues - numOutliers - 1; i < numValues; i++)
                {
                    long outlier = this.diagnosticTimes[i];
                    Trace.WriteLine($"#{i + 1}: {outlier}ms");
                }

                // safely exit the application
                SK.Quit();
                return;
            }

            Color32[] newColors = new Color32[this.imageWidth * this.imageHeight];

            stopwatch.Start();

            this.tex.SetColors(this.imageWidth, this.imageHeight, newColors);

            long l = stopwatch.ElapsedMilliseconds;
            stopwatch.Reset();

            UI.WindowBegin("image plane", ref this.windowPose, moveType: UIMove.FaceUser);
            UI.Image(this.sprite, this.imageAspectRatio);
            UI.WindowEnd();

            diagnosticTimes.Add(l);
        }
    }

    internal class Program
    {
        static void Main(string[] args)
        {
            // Initialize StereoKit
            SKSettings settings = new SKSettings
            {
                appName = "SetColorsSpeedTest",
                assetsFolder = "Assets",
            };
            if (!SK.Initialize(settings))
                Environment.Exit(1);

            SK.AddStepper(new ImagePlane());

            // Create assets used by the app
            Pose cubePose = new Pose(0, 0, -0.5f, Quat.Identity);
            Model cube = Model.FromMesh(
                Mesh.GenerateRoundedCube(Vec3.One * 0.1f, 0.02f),
                Default.MaterialUI);

            Matrix floorTransform = Matrix.TS(0, -1.5f, 0, new Vec3(30, 0.1f, 30));
            Material floorMaterial = new Material(Shader.FromFile("floor.hlsl"));
            floorMaterial.Transparency = Transparency.Blend;

            // Core application loop
            while (SK.Step(() =>
            {
                if (SK.System.displayType == Display.Opaque)
                    Default.MeshCube.Draw(floorMaterial, floorTransform);

                UI.Handle("Cube", ref cubePose, cube.Bounds);
                cube.Draw(cubePose.ToMatrix());
            })) ;
            SK.Shutdown();
        }
    }
}

Logs or exception details

All timing measurements were made over 100 render frames. For each variable, I performed the measurement 3 times and kept the most average-looking results.

Legend

key meaning
min Minimum time of one function call for n frames
max Maximum time of one function call for n frames
median Time of one function call for frame # n/2
outliers Top 5 longest times of the function call over n frames

Diagnostic times

NoMips Dynamic
x64 (PC)
min: 4ms, max: 22ms, median: 5ms
printing top 5 outliers:
#95: 10ms
#96: 10ms
#97: 11ms
#98: 11ms
#99: 13ms
#100: 22ms
    
min: 4ms, max: 13ms, median: 4ms
printing top 5 outliers:
#95: 7ms
#96: 7ms
#97: 8ms
#98: 8ms
#99: 9ms
#100: 13ms
    
ARM (HL2)
min: 4ms, max: 37ms, median: 6ms
printing top 5 outliers:
#95: 11ms
#96: 15ms
#97: 23ms
#98: 25ms
#99: 30ms
#100: 37ms
    
min: 3ms, max: 33ms, median: 5ms
printing top 5 outliers:
#95: 8ms
#96: 8ms
#97: 9ms
#98: 9ms
#99: 23ms
#100: 33ms
    
ARM64 (HL2)
min: 10ms, max: 51ms, median: 13ms
printing top 5 outliers:
#95: 26ms
#96: 27ms
#97: 27ms
#98: 34ms
#99: 45ms
#100: 51ms
    
min: 10ms, max: 40ms, median: 12ms
printing top 5 outliers:
#95: 20ms
#96: 20ms
#97: 22ms
#98: 27ms
#99: 37ms
#100: 40ms
    

Discussion

maluoi commented 2 years ago

I'll look into this in detail in the morning, thanks a bunch for writing this up! The ARM vs ARM64 thing is ...unexpected to say the least, the big difference being that ARM64 uses .NET Native, and ARM generally does not.

I'll also note that you can specify more than one TexType, it's a bit flag. So TexType.Dynamic | TexType.ImageNomips will get you both at the same time.

Edit: Also Single vs. Atlased shouldn't make a difference here, you should see logs in the console noting that Atlased isn't implemented, and it falls back to Single anyhow :)

austinbhale commented 2 years ago

Thanks so much for the speedy response!

I did a big facepalm after seeing the docs about Atlased not being implemented after making the table 😅 That's awesome that TexType is a bit flag!

So, I just tested the ARM application again with the TexType set to TexType.Dynamic | TexType.ImageNomips and still am receiving the following outliers:

min: 3ms, max: 46ms, median: 5ms
printing top 5 longest times:
#95: 7ms
#96: 7ms
#97: 8ms
#98: 10ms
#99: 19ms
#100: 46ms

These outliers are my main concern for creating the issue, but the ARM vs ARM64 comparison just peaks my curiosity too!

austinbhale commented 2 years ago

I also noticed calling Mesh.SetVerts without calculating the bounds looks like it should simply be copying data in:

https://github.com/StereoKit/StereoKit/blob/dce5945141c000a16fd7497243fab45bd507882b/StereoKitC/libraries/sk_gpu.h#L1006

However, it could be that the Map operation is taking a varying amount of time: https://stackoverflow.com/questions/40808759/id3d11devicecontextmap-slow-performance

Perhaps an option to specify the D3D11_MAP_FLAG_DO_NOT_WAIT flag could boost performance? <- src.

I tested out a very similar design to this sample program for the ARM architecture on HL2 and received outliers that are a factor of ~3-10 of the median time.

Mesh.SetVerts output (ARM + HL2)

MeshRenderer: 163ms
MeshRenderer: 30ms
MeshRenderer: 31ms
MeshRenderer: 27ms
MeshRenderer: 106ms
MeshRenderer: 27ms
MeshRenderer: 27ms
MeshRenderer: 28ms
MeshRenderer: 41ms
MeshRenderer: 20ms
maluoi commented 2 years ago

Yeah, I do suspect that a lot of the issue here is related to interaction with the GPU, which can be kinda complicated. Basically, if the GPU is busy doing something like drawing the previous frame, then it has to wait until the GPU unlocks before a Map can occur.

D3D11_MAP_FLAG_DO_NOT_WAIT is perhaps part of the solution, but that flag IIRC should cause the Map call to error out instead of waiting! You can then attempt to Map later on when the device isn't busy, but rescheduling that task is not exactly simple in this case. There needs to be some amount of interaction with SK's async asset pipeline here to let these slide until later in the frame, or to the next frame.