BrianHarrisonUnity commented 1 year ago

I tried changing the event configuration to the following in UnityRenderEvent.cpp:

        UnityVulkanPluginEventConfig encodeEventConfig;
        encodeEventConfig.graphicsQueueAccess = kUnityVulkanGraphicsQueueAccess_Allow;
        encodeEventConfig.renderPassPrecondition = kUnityVulkanRenderPass_EnsureOutside;
        encodeEventConfig.flags = kUnityVulkanEventConfigFlag_EnsurePreviousFrameSubmission |
        kUnityVulkanEventConfigFlag_ModifiesCommandBuffersState | kUnityVulkanEventConfigFlag_FlushCommandBuffers;

        UnityVulkanPluginEventConfig releaseBufferEventConfig;
        releaseBufferEventConfig.graphicsQueueAccess = kUnityVulkanGraphicsQueueAccess_Allow;
        releaseBufferEventConfig.renderPassPrecondition = kUnityVulkanRenderPass_EnsureOutside;
        releaseBufferEventConfig.flags = kUnityVulkanEventConfigFlag_EnsurePreviousFrameSubmission |
        kUnityVulkanEventConfigFlag_ModifiesCommandBuffersState | kUnityVulkanEventConfigFlag_FlushCommandBuffers;

And I believe everything is working as before with only this change.

Even if we need additional fixes outside of this change, I believe we will need to update the event configuration to set the graphicsQueueAccess to kUnityVulkanGraphicsQueueAccess_Allow if we go this route.

karasusan commented 1 year ago

I checked your suggestion and fixed the issue. I made another PR here. https://github.com/Unity-Technologies/com.unity.webrtc/pull/889

There are two options to solve this issue. Which do you prefer? @BrianHarrisonUnity I think https://github.com/Unity-Technologies/com.unity.webrtc/pull/889 is less chance of degradation. On the other hand, #888 reduces count of calling vkQueueSubmit, so it might affects performance.

aet commented 1 year ago

@karasusan I recently started to look at the current webrtc encoding/gfx integration, but I had a branch against the old integration where Vulkan was heavily optimized for resource handling and Unity state integration. I'd very much prefer to go with commandRecordingState integration (along with few other optimizations eventually). As of now, this branch has a performance issue the way VulkanGraphicsDevice::WaitSync is handled, my stream goes from 60fps to 10-15fps

karasusan commented 1 year ago

@aet Thanks for your opinions.

As you said, I also prefer to use commandRecordingState in terms of performance.

VulkanGraphicsDevice::WaitSync implementation in this PR has sleep time for waiting texture copy. Do you think there are some issues for performance? I tested this PR on Windows 10 and NvCodec and I haven't found impact on the framerate.

aet commented 1 year ago

@karasusan huh, interesting. I'm usually testing against Linux. As for NvCodec drivers, there's a small difference on win32 vs others that Linux doesn't support enableEncodeAsync for example. I'll give it a test on PC some day, but also keep looking for improvements how to handle things.

karasusan commented 1 year ago

@aet OK, I am going to check on Linux

karasusan commented 1 year ago

@aet You were right! I reproduced the issue on my Ubuntu 20.04. But I am not sure why it occurrs on Linux...

BrianHarrisonUnity commented 1 year ago

I think the original issue is that we need a way to synchronize the encoding event with the Graphics.ConvertTexture & Graphics.Blit calls happening on the C# side. Exclusive queue access was needed since we are directly accessing the graphics queue, and adding the command buffer flush was a way to achieve the synchronization, but it can be terrible for performance, and seems to cause issues on Linux.

Swapping to use the UnityCommandRecordingState would allow us not to need exclusive queue access, and would be much better for performance, but I think we need to find a new way to synchronize. I believe waiting on the frameNumber is why the performance has degraded. We are now waiting on the frame to increment, which will include extra work that isn't needed in our case. That being said, I am unsure how else we could synchronize without a fence, or full flush.

karasusan commented 1 year ago

I found this fix has an issue that the video stream goes under 10fps on Linux. This issue is occurred on Unity runtime 2022.2.0f1 or later. I tested it with Unity 2022.1.24f1 but the issue wasn't replicated.

aet commented 1 year ago

Ok, I'm still using 2021.2.15f1 :) Trying to finish some further debugging tomorrow, but it's on my ongoing interest regardless. I can also give 2022.1.24f1 a try

karasusan commented 1 year ago

built with Unity 2022.2.12f1
Ubuntu 20.04

https://user-images.githubusercontent.com/1132081/228466730-6029bc32-98be-4718-9bb5-ecf5b45619f1.mp4

karasusan commented 1 year ago

I found a workaround for the performance issue of video streaming.

We need to set VSync Count to Don't Sync in the Quality Settings.

aet commented 1 year ago

@karasusan Calling CommandRecordingState outside of the render thread is what it is, it's likely to fail majority of the time. For simple frame tracking purposes one might as well query and track the latest frame info within render thread, I'm doing it at the resource copy level for vulkan graphics device for now. Helpful, but plenty of other areas to optimize that I'm trying to stabilize.

While trying to get into the mindset of the latest codebase, I've been experimenting with some of the past optimizations but some of them are not easy to do quickly or without applying changes all of the place.. for Vulkan specifics, right now my biggest obstacle is that I can't run the player with -force-vulkan-layers without crashing, compared to builds from a year ago. :) Happens on all Unity versions, won't happen without the plugin.

karasusan commented 1 year ago

@aet Thanks for the advice! I am sure that calling CommandRecordingState fails on worker threads. I found that calling the function doesn't fail when set kUnityVulkanGraphicsQueueAccess_Allow for a queueAccess argument. I will check this change fixes this issue later.

I haven't checked -force-vulkan-layers for the WebRTC package. How are you using this command parameter?

aet commented 1 year ago

Just exporting a Linux player and running it with an extra CLI argument. Doesn't have to be a development player. On ubuntu it should be enough to have vulkan-validationlayers package installed and then just observe the Player.log for extra validation messages.

It's been a while since working with the actual Unity Vulkan implementation details, I had this vague memory that I couldn't do something I needed to do earlier through kUnityVulkanGraphicsQueueAccess_Allow, so tried to work my way through render thread instead when possible. :) With kUnityVulkanGraphicsQueueAccess_Allow you're getting the task executor commandbuffer at least

BrianHarrisonUnity commented 1 year ago

I have been testing with the vulkan validation layers on, and they have been working for me. I just installed the vulkan SDK, and then run with -force-vulkan-layers. Works with both the editor & standalone builds in my case.

I made a different testing branch, experimental/vulkan-queue-access-submit that resolves the validation errors I was seeing, and I think should have the same performance as before.

aet commented 1 year ago

@BrianHarrisonUnity this is on Linux? I'm using Ubuntu 22.04

BrianHarrisonUnity commented 1 year ago

Yes, but I am using Ubuntu 20.04 on Linux.

aet commented 1 year ago

Found a workaround, removing -static-libstdc++ rules for Linux target, otherwise getting the following call stack

Obtained 23 stack frames.

0 0x007ff78dde8520 in __sigaction

1 0x007ff78de3ca7c in pthread_kill

2 0x007ff78dde8476 in raise

3 0x007ff78ddce7f3 in abort

4 0x007ff78de2f6f6 in __fsetlocking

5 0x007ff78de46d7c in timer_settime

6 0x007ff78de48ac4 in __default_morecore

7 0x007ff78de4b4d3 in free

8 0x007ff7832ba07d in std::locale::_Impl::~_Impl()

9 0x007ff780296ae6 in std::locale::~locale()

10 0x007ff6a1ca7d6d in vkGetDeviceProcAddr

11 0x007ff6a15bd586 in (Unknown)

12 0x007ff6a1bddffb in vkGetDeviceProcAddr

13 0x007ff6a18b6520 in vkGetDeviceProcAddr

14 0x007ff6a175baa0 in vkGetInstanceProcAddr

15 0x007ff78effc67e in vk::RenderPassSwitcher::ClearCurrentFramebuffer(vk::CommandBuffer, unsigned int, ColorRGBAf const, int, unsigned int, float, unsigned int)

16 0x007ff78effc170 in GfxDeviceVK::ClearImpl(GfxClearFlags, ColorRGBAf const*, int, unsigned int, float, unsigned int)

17 0x007ff78f6c715c in GfxDeviceWorker::RunCommand(ThreadedStreamBuffer&)

18 0x007ff78f6d049f in GfxDeviceWorker::RunExt(ThreadedStreamBuffer&)

19 0x007ff78f6c6dc5 in GfxDeviceWorker::RunGfxDeviceWorker(void*)

20 0x007ff78edee829 in Thread::RunThreadWrapper(void*)

21 0x007ff78de3ab43 in pthread_condattr_setpshared

22 0x007ff78decca00 in __xmknodat

aet commented 1 year ago

Ok made some progress today. I saw some validation issues I've seen in the past, particularly with image layout transitions and all. Got them fixed by porting my old changes but testing still ongoing in general and for performance, it's tempting to start side tracking and port larger IGraphicsDevice changes across the board :) So trying to stay within Vulkan domain for now with the porting

aet commented 1 year ago

@karasusan I don't know if you're in a hurry to move somewhere with vulkan stuff soon, but looks like I'm currently digging myself deeper into some vulkan and generic level optimizations. The bad part is that my approach to CommandRecordingState usage works better if the whole frame pooling is also renewed, the new alternative is working pretty good but it's equally possible it might not turn out to be a good generic solution vs my cloud performance requirements.. we'll see.

As of today, I'm seeing gigabytes of leaks in relatively short time with the main branch when running on cloud, the new pooling structure has them mostly fixed, but I'm now debugging the performance difference between a local RTX2060 and Tesla T4 behavior on cloud. The local performance is pretty ok, but not happy with the T4 behavior

BrianHarrisonUnity commented 1 year ago

@aet might be worth while trying with the latest changes to #889. That branch resolves most of the vulkan validation errors I was seeing, and performs well for me with Linux & with T4's in the cloud.

aet commented 1 year ago

@BrianHarrisonUnity Will do, have you been observing the framebuffer memory usage from nvidia stats on cloud? An app that doesn't leak locally keeps currently leaking for me on cloud, easily up to 15 gigabytes from 1.5 gb starting point in a matter of 3 hours. The regression seems to be somewhere within webrtc, it doesn't happen when using a modified branch based off from M92 stuff a year ago

BrianHarrisonUnity commented 1 year ago

@aet Wow, that is a huge leak. I haven't been checking the GPU memory usage on the cloud T4, but I can check tomorrow to see.

Just to confirm, is this 3 hours of continous streaming between a webclient and the cloud server? Also, what codec, resolution, bitrate, and FPS are you configured for?

aet commented 1 year ago

It's a cloud server broadcasting a single connection to Janus, that handles the browsers. Targeting 1080p60 streaming at 8mbits or more, using h264/nvenc.

BrianHarrisonUnity commented 1 year ago

@aet I ran a test for an hour with 1920x1080, 8+ mbit/s, and H264 on a T4 in the cloud, and my GPU memory went from 700 MiB to 1410 MiB.

I saw a few DC's due to connection instability, and I could see the memory jump after that, but it stayed pretty consistent otherwise. Seems like there might be a slow leak in my case, but not as severe as what you were seeing. For me, it looks like some memory isn't freeing up after disconnections, but otherwise it's consistent.

Edit: Tried rapidly reconnecting, and it looks like it's creeping up each reconnect for me. This is the total machine GPU memory though from nvidia-smi. I'll have to check again on my local machine, and get the process specific GPU memory to see if this is accurate.

aet commented 1 year ago

@BrianHarrisonUnity No worries. I didn't have time to test https://github.com/Unity-Technologies/com.unity.webrtc/pull/889 yet but I see it got merged into main. It's fine though :)

I'm used to that my use case goes through some optimizations that might not be suitable for general purpose, aiming to reach high GPU utilization, avoid GPU stalls with potential extra cost of some latency and memory usage, but I'll try to contribute my changes when I can to reduce maintenance overhead. For cost reduce, it's aiming to run things with pretty small cpu count to add some extra woes.

I'll check back on Vulkan specifics and pooling/frame scheduling improvements once I have gotten more testing mileage!

karasusan commented 1 year ago

@aet @BrianHarrisonUnity FYI: Is this related your memory issue? We found this issue last year but it haven't been solved. https://github.com/Unity-Technologies/com.unity.webrtc/issues/765

aet commented 1 year ago

@karasusan shouldn't be related to my leak, the connections are between two peers only and remain the same across the lifetime of the app

BrianHarrisonUnity commented 1 year ago

@karasusan Oh that is 100% what I was seeing. Thanks for the heads up! Strange issue, looks like the GPU memory goes up, but the GPU memory assigned to the process doesn't.

Unity-Technologies / com.unity.webrtc

fix: Use IUnityGraphicsVulkan::CommandRecordingState for commandbuffer #888

0 0x007ff78dde8520 in __sigaction

1 0x007ff78de3ca7c in pthread_kill

2 0x007ff78dde8476 in raise

3 0x007ff78ddce7f3 in abort

4 0x007ff78de2f6f6 in __fsetlocking

5 0x007ff78de46d7c in timer_settime

6 0x007ff78de48ac4 in __default_morecore

7 0x007ff78de4b4d3 in free

8 0x007ff7832ba07d in std::locale::_Impl::~_Impl()

9 0x007ff780296ae6 in std::locale::~locale()

10 0x007ff6a1ca7d6d in vkGetDeviceProcAddr

11 0x007ff6a15bd586 in (Unknown)

12 0x007ff6a1bddffb in vkGetDeviceProcAddr

13 0x007ff6a18b6520 in vkGetDeviceProcAddr

14 0x007ff6a175baa0 in vkGetInstanceProcAddr

15 0x007ff78effc67e in vk::RenderPassSwitcher::ClearCurrentFramebuffer(vk::CommandBuffer, unsigned int, ColorRGBAf const, int, unsigned int, float, unsigned int)

16 0x007ff78effc170 in GfxDeviceVK::ClearImpl(GfxClearFlags, ColorRGBAf const*, int, unsigned int, float, unsigned int)

17 0x007ff78f6c715c in GfxDeviceWorker::RunCommand(ThreadedStreamBuffer&)

18 0x007ff78f6d049f in GfxDeviceWorker::RunExt(ThreadedStreamBuffer&)

19 0x007ff78f6c6dc5 in GfxDeviceWorker::RunGfxDeviceWorker(void*)

20 0x007ff78edee829 in Thread::RunThreadWrapper(void*)

21 0x007ff78de3ab43 in pthread_condattr_setpshared

22 0x007ff78decca00 in __xmknodat