Fix: Vulkan framerate with vsync enabled

aet commented 9 months ago

Small follow-up to #966 @karasusan @BrianHarrisonUnity

If the goal is to depend on Unity's Vulkan graphics pipeline and state as much as possible, then using a custom command buffers and fences is not an option. :)

But Unity plugin interface doesn't expose enough information to operate with fences directly (unlike DX12 interface), instead it uses frame counters. A safe frame counter is typically updated by the end of the frame and there is nothing that triggers IUnityGraphicsVulkan::CommandRecordingState updates for you automatically if you try to do multi-threaded processing that depends on synchronization.

It seems the approach in #966 should be mostly good enough, but there's some platforms like LinuxStandalone and Android where the framerate drops noticeably compared to previous approach when vsync is enabled.

After some debugging, it seems the earliest phase when Unity graphics device has updated the safe frame counter is after PostLateUpdate.PresentAfterDraw so we can synchronize CommandRecordingState state to plugin side graphics device. It doesn't seem there's a user friendly way to ensure a custom synchronization to rendering plugin happens, without adding a custom pass through PlayerLoop.

Tested with LinuxStandalone vsync on/off on my custom branch, runs steady 1080p60 with both options but previously with vsync on the framerate was 20fps.

My assumption is that this will also help Android a lot and likely avoids the need for I420Buffer::Create workaround at GpuMemoryBufferFromUnity::ToI420 due to timing out.

BrianHarrisonUnity commented 9 months ago

I'll need to test the changes when I have a moment, but I think this makes sense. Seems like this is as close as we can get to the fence version. There is still going to be a bigger delay than the fence version, as I believe the frame number won't get incremented until the frame sync has finished, but I think this is likely the best we can do.

The other option we could explore, is going back to the custom command buffer, and then trying to improve how the plugin events work. I think the issue that was causing problems with the previous fix, is that the custom command buffer version needed an event with the resources bound to set up the command buffer for the copy, and then needed a separate event with exclusive graphics queue accesss to submit that buffer. For the devices that had issues, I think the second event ended up getting scheduled later, and that was causing the stream issues. I would have to experiment with it, but I think there is likely a way to get this down to a single event with exclusive graphics queue access, and that would allow us to use a fence again.

If this fix works for you @karasusan, then I think we are good, but if you continue to hit issues, then let me know, and i'll see if I can get a version with a single plugin event to test. Unfourtunately, I have been unable to replicate any of the stream issues on my devices!

aet commented 9 months ago

@karasusan forgot to update method name, C# tests should work again

karasusan commented 9 months ago

@aet I have tested this PR on Ubuntu 20.04 and Android (Galaxy Tab S4). Unfortunately, the vsync issue has not been resolved. We need to continue to investigate the issue. However, to continue the investigation would extend the release schedule.

To prevent schedule delays, we will revert to fix #966, which is the cause of the current problem. As for the other PR #967, I have confirmed improved performance and will merge it in.

I apologize for my inability to resolve the issue.

aet commented 9 months ago

@karasusan that's ok. Meanwhile I suggest you could do a build time #ifdef to choose between unity CRS and custom commandbuffer/fence paths? For easier comparison and debugging

aet commented 9 months ago

It's possible the vsync behaves differently in my testing when it's against a branch with modified video frame pooling and scheduling

karasusan commented 9 months ago

@aet Thank you for understanding. I'm making a PR.

karasusan commented 9 months ago

＠aet Oh, are you already start to modify the PR to be able to switch two ways? https://github.com/Unity-Technologies/com.unity.webrtc/pull/967#issuecomment-1722371437

I'm checking it.

doctorpangloss commented 9 months ago

I've gone down this journey in April, you don't need to interact with the fences at all if you don't want to.

The other option we could explore, is going back to the custom command buffer, and then trying to improve how the plugin events work. I think the issue that was causing problems with the previous fix, is that the custom command buffer version needed an event with the resources bound to set up the command buffer for the copy, and then needed a separate event with exclusive graphics queue accesss to submit that buffer. For the devices that had issues, I think the second event ended up getting scheduled later, and that was causing the stream issues.

What does this mean?

exclusive graphics queue accesss to submit that buffer... For the devices that had issues

When using render pipelines:

RenderPipelineManager.endCameraRendering += OnRenderPipelineManagerOnEndCameraRendering

private void OnRenderPipelineManagerOnEndCameraRendering(ScriptableRenderContext context, Camera camera1)
{
  // todo: make sure the render texture target on camera1 is created
  var cmd = CommandBufferPool.Get("WebRTC Blit");
  m_source.Encode(this, cmd);
  context.ExecuteCommandBuffer(cmd);
  context.Submit();
  CommandBufferPool.Release(cmd);
}

public void Encode(this VideoTrackSource source, VideoStreamTrack videoStreamTrack, CommandBuffer cmd)
{
    if (source.destTexture_ == null || source.destTexture_.GetNativeTexturePtr() == IntPtr.Zero)
    {
        return;
    }
    if (source.needFlip_)
    {
        cmd.Blit(source.sourceTexture_, source.destTexture_, s_scale, s_offset);
    }
    else
    {
        cmd.Blit(source.sourceTexture_, source.destTexture_);
    }
    videoStreamTrack.UpdateVideoStreamStructs(source.destTexture_);
    var trackCount = 1;
    // i had to open a bunch of internals here
    WebRTC.Context.batch.ResizeCapacity(trackCount);
    WebRTC.Context.batch.data.tracks[0] = videoStreamTrack.DataPtr;
    WebRTC.Context.batch.data.tracksCount = trackCount;
    Marshal.StructureToPtr(WebRTC.Context.batch.data, WebRTC.Context.batch.ptr, false);
    cmd.IssuePluginEventAndData(WebRTC.Context.batchUpdateFunction1, WebRTC.Context.batchUpdateEventID1,
        WebRTC.Context.batch.ptr);
}

works flawlessly on eliminating the latency with proper queued framed pipelining at this stage on the platforms I test, with no changes to the C++ code, on DirectX 11 and 12. Why wouldn't this work for Vulkan?

Also the scheduling in the PlayerLoop is really complicated. Skinned animations will be glitchy if you do PostLateUpdate. Lots of bugs lurking down that way. Ask me how I know.

aet commented 9 months ago

For a long time my branch didn't use fences (or GPU flushes that's worse option for overall performance) at all on any target, I sacrificed a bit more memory usage and latency by pooling video frames before they were actually submitted to the encoder. I still do something similar, but the upstream fence code is included. The added latency is not that bad, but I can understand the upstream version likely wants to keep it as low as possible so fences it is. Different use cases and requirements.

The issue is very low level and specific to how Unity's internal vulkan GfxDevice is implemented and what functionality is exported to NativeRenderingPlugin interfaces. Things might have changed in recent years since I stopped working at Unity, but at least it used to be the case that most modern GfxDevice implementations were using a single native command buffer per frame, more or less. There's usually challenges ensuring existing expected legacy built-in RP behavior works vs what SRP wants to move forward with, it'd usually be better to submit some work in smaller batches, but too much of fine-grained command buffers isn't always ideal either.

With Vulkan, the frame counter depends on VkFence internally but there's a lot of synchronization overhead and latency for a 3rdparty plugin if you can't query the state natively in the plugin, on a non-rendering thread for example. Ultimately the best performance fix would be to export a new rendering interface (IUnityGraphicsVulkanV3) in the plugin and export access to native fences to those who want to depend on it, just like it's an option with DX12. Using a custom command buffer with custom fences to submit some work is a workaround.

I think a plugin using kUnityVulkanEventConfigFlag_FlushCommandBuffers would enforce submitting a command buffer, but it's kind of pointless for this use case, if WebRTC aims to send single/multiple video tracks once per end of frame. This would submit the existing command buffer before the plugin adds additional work.

In the plugin, DX11 uses a custom, but native fences per texture for synchronizing between texture blit <> continuing encoding work.

DX12 uses per frame fence, a native handle provided by the rendering plugin interface. This is the closest equivalent for Vulkan

Vulkan hides the native fence interface, so relying on unity's safe frame counters might include an additional latency because you'll need to trigger an update for the state somehow, somewhere. This could be improved by exposing the native fences or adding an additional callback from the graphics device. Or keep doing a bit of extra effort in the plugin by creating a custom command buffer/fence per texture and submit those. It's likely the least awkward solution if no other problems occur. 🥲

Or you could add some latency by submitting the video frame the next frame, after the texture was initial copied.

The higher level code in the rendering plugin does WaitSync() with a hardcoded 30ms requirement when processing work outside unity's rendering thread, somewhere in webrtc internals. There's not much you can do at that point anymore, but definitely having an access to native rendering handles is always better when dealing with trivial 3rdparty layers that need extra synchronization. If you're rendering at a steady 60fps, the WaitSync 30ms is likely ok and won't result to dropped frames, but if you're submitting work at slower framerate, there's likely a chance more frames get dropped because the safe frame counter status takes longer period to update.

I think the C# code you're using is fine if you don't want to submit single/multiple video frames at once through WebRTC.Update(), but it shouldn't change when the work gets submitted to the GPU that much in the native rendering level vs Unity CommandBuffers on C# side?

doctorpangloss commented 9 months ago

I think the C# code you're using is fine if you don't want to submit single/multiple video frames at once through WebRTC.Update(), but it shouldn't change when the work gets submitted to the GPU that much in the native rendering level vs Unity CommandBuffers on C# side?

It's really hard for me to comprehend, you would know better than I do.

Based on profiling, the greatest source of C++-related delay was the pull-based encoding introduced by the "frame scheduler," which was removed.

Next, the highest source of latency is that the current com.unity.webrtc code executes the copy-the-render-texture-to-CUDA-memory-&-synchronously-encode (hereby the "encode") at essentially a random time in the middle of the CPU stage of the render pipeline. It doesn't wait for the frame to finish rendering, it will encode whatever is in the render texture, which is almost always the contents of the previous frame. This is what my fix addresses: it causes the encode to start right when the GPU stage finishes the final blit to the render texture. This is actually the same approach as the Unity Video Recorder package.

Finally, the fences stuff you are talking about... I understand these issues worse than you. My vague understanding is that you can perform the same "do the encode just after the blit occurs" using fences. My read is that you are doing this because you observed that the encode was occurring at a random time in the middle of the CPU stage, like I did. The only place where I really understand that fences can provide better performance is coordinating the DirectX 12 encoder engine in NVEnc: this would allow the C# API to simply declare a fence, and when it is passed, NVEnc knows the final blit occurred, and it will do the blit into its video texture pool, encode, and signal another fence for you - presumably the fence you are waiting on to send the encoded data over the network via WebRTC.

My interpretation of the profiler is that there is at most 1ms of latency to earn from implementing this, which doesn't sound like much. My limited understanding is that using the fence approach NVEnc documents greatly improves CPU occupancy, because then the webrtc plugin can use the async NVEnc Encode API, which my use case is very sensitive to, and not block a render thread.

aet commented 9 months ago

:) No worries, there's plenty of NvEncoder and NvDecoder specifics left to optimize. I used to have a custom encoder back in M85-M92 releases but didn't try porting it to the latest architecture.

One thing to keep in mind about nvenc async api is that for some reason it's not supported on Linux. In it's current form when the nvenc doesn't happen directly in the render thread blocking things, it's not as bad, but I had some optimizations against non-async version to improve Linux performance, some pooling and offloading jobs off render thread.

There's some trivial changes to optimize and some bigger ones.. it does seem like out of the box nvidia's sample NvEncoder is causing more blocking than expected with nvEncLockBitstream. Sometimes it just feels more straightforward to make it custom instead of modifying the SDK sample. But I try not to drift too far away from upstream and contribute some changes when possible.

Unity-Technologies / com.unity.webrtc

Fix: Vulkan framerate with vsync enabled #975