Feedback on the video decoding extension(s)

cyanreg commented 2 years ago

Hi,

I'm writing hardware decoding code for FFmpeg (code), and in the process I've found some issues with the extensions. The first one is simple, VkVideoDecodeH264ProfileEXT.pictureLayout is required to not be 0, but VkVideoDecodeH264PictureLayoutFlagBitsEXT has the value for VK_VIDEO_DECODE_H264_PICTURE_LAYOUT_PROGRESSIVE_EXT set as 0 (spec). # In the blog post you asked for feedback on sub-picture level decoding:

Currently only picture-level decode commands are supported (as specified by the appropriate codec-specific EXT extension structures for decode operations, for example VkVideoDecodeH264PictureInfoEXT). We are interested to hear of use cases that need to request more fine grained operations!

The main use-case is for reduced latency. After all, at 60fps, having to wait for 16 whole milliseconds while you receive a frame from a realtime stream, and only then submit a queue to decode it, and then spend 16 more milliseconds while you schedule it for presentation will double the latency. At the cost of screen tearing, you can partially redraw parts of the screen, thus if you do a row of slices at a time, you can cut down on a lot of latency. FFmpeg supports sub-frame decoding, and can output a row at a time. Would be nice if we could support that using Vulkan hardware decoding.

But moreover, the biggest issue with the decoding API is how you're supposed to have all slices in a single VkBuffer. Sure, you can solve that by using sparse buffers, but a lot of devices do not support such, so your only choice is to copy each slice into a large enough RAM memory allocation, and then upload that to the GPU. Some bitstreams reach bitrates of hundreds, if not thousands of megabits per second, which can take a significant amount of resources. It would be much better if you could decode a slice at a time as you're receiving the slices, which would hugely reduce allocations and copies needed. Or even if still decoding an entire picture at a time, you could let the GPU work on decoding quicker, rather than wasting a whole 16 milliseconds or so while waiting to receive each slice the frame has.

So, my recommendation would be to introduce a new vkCmdDecodeSlice command, which takes a new VkVideoDecodeSliceInfo structure, which allows for users to feed in a slice inside a single VkBuffer at a time. Additionally, a vkCmdStartFrame and vkCmdEndFrame commands to signal when a decoding starts and ends. All of this must still take place in a single submission, of course, so calling vkCmdStartFrame without calling vkCmdEndFrame would be an error. If slice-level decoding is found to be within scope, the user can signal VkEvents inside the submission, which would inform the user whether a slice has been decoded and the contents in the VkImage corresponding to the slice are valid. Additionally, the VkVideoDecodeInfoKHR structure ought to be modified to permit having each slice in a separate VkBuffer object.

zlatinski commented 2 years ago

Hi @cyanreg, first of all - thank you very much for working on adding Vulkan Video support to FFMPEG! We really appreciate your contribution. We highly value your input on the Vulkan Video speechification and the API.

Let me start with the first question about the latency on decoding complete frames vs. slice mode.

After all, at 60fps, having to wait for 16 whole milliseconds while you receive a frame from a real-time stream, and only then submit a queue to decode it, and then spend 16 more milliseconds while you schedule it for presentation will double the latency.

Please help me understand this. Is this a limitation with real-time streaming that it only delivers the data at a specific rate and can not deliver more data? The way the decode should work on Vulkan is by working with the parser/decoder/render/display ahead of time:

Parsing bitstream and schedule HW decode of multiple frames ahead of time. Using multithreaded parse and/or buffer recording should be happening at > 300 FPS even for 4k content.
There should be multiple decode frames queued in the Vulkan decode queue and those should each take much, much less than 16 ms time (for 4k content) to decode on contemporary decode HW. Hopefully, the application is not using fences where is waiting on each frame at the CPU side for the decoded frames to be processed. As soon as the frames are submitted to the decoded queue, those should be submitted right away with a Vulkan semaphore to the graphics and the display. This is the best way to reduce latency - having a lot of work queued at the HW decoder/GPU and display ahead of time.
The (Vulkan WSI) display should have at least 2-4 frames queued with present timing information, so as soon as the video present time has been reached, those frames should be flipped to the display (by the display HW).

For example, the current Vulkan Video sample app that we are providing is only limited by the display rate (we do not support the display timing yet) and when using a display with a refresh rate of 240 FPS, we do not have trouble maintaining the 240 FPS with a 4k@60 h.264/5 video content. If we were to use one of the display timing extensions (we are planning to add it soon) then the latency should be almost "0" -. i.e. would only depend on the display's flip latency.

zlatinski commented 2 years ago

On more note about slice mode: Decoding a VCL slice would not allow for it to be presented, right away, with most of the content. Furthermore, not all frames are presentable and the display order is not necessarily the same as the decode order. I know there are some of the VR-targeting low-latency encode setups that allow for that, but if one were to follow the above setup, latency shouldn't be an issue.

But moreover, the biggest issue with the decoding API is how you're supposed to have all slices in a single VkBuffer. Sure, you can solve that by using sparse buffers, but a lot of devices do not support such, so your only choice is to copy each slice into a large enough RAM memory allocation, and then upload that to the GPU.

For this second point, why do we need the bitstream to be put in RAM? Because we do not know the size of the bitstream ahead of time? If sparse memory is not supported, the application can allocate multiple chunks of VkBuffers as a circular buffer with a size that can accommodate bitstream data worth of one frame each. I would guess the limitation you are seeing is related to the HW buffers alignment requirement., so when data gets received from a video stream, it can be pre-parsed and placed as a cache in VkBuffers from get go without doing extra copies. One can also work with the data offset in the bitstream, that does not necessarily has to be HW buffer aligned (srcBufferOffset).

zlatinski commented 2 years ago

All that being said, the Khronos Video TSG has plans to add support for slice mode, but after the first Vulkan Video release. In the meantime, we still want to ensure that there aren't any major Vulkan Video API limitations on supporting low-latency video streaming applications in the absence of this feature.

cyanreg commented 2 years ago

Please help me understand this. Is this a limitation with real-time streaming that it only delivers the data at a specific rate and can not deliver more data? The way the decode should work on Vulkan is by working with the parser/decoder/render/display ahead of time:

All you described here is caching. Caching is how you ruin latency, not fix it. While the actual decoding process takes only a few milliseconds to finish, this skyrockets when you have more streams that you have to decode at the same time. NVIDIA hardware has an artificial limit of IIRC 4 streams on consumer hardware, but other vendors do not, and instead, schedule their tile decoder units around all streams' tiles that have to be decoded. What happens with picture-level decoding and multiple streams is that a single high profile high bitrate stream can bottleneck the entire decoder, preventing the other streams from being decoded in realtime. This is the whole point of Vulkan - offloading as much work off the driver as possible, only letting it do scheduling such that all API users get a fair slice, and have limited opportunity to interfere with other work scheduled by other API users.

All that being said, the Khronos Video TSG has plans to add support for slice mode, but after the first Vulkan Video release. In the meantime, we still want to ensure that there aren't any major Vulkan Video API limitations on supporting low-latency video streaming applications in the absence of this feature.

What about doing some preprocessing as slices are being received? For example, into a new opaque VkVideoDecodedSlice object, created from a vkCmdDecodeSlice command, which vkCmdDecodeVideoKHR could accept a list of in order to decode? That way the GPU is still able to do something in the dead time of receiving slices.

aabdelkh commented 2 years ago

Thanks again for this very valuable feedback @cyanreg !

As @zlatinski mentioned, Video TSG is definitely considering low latency slice decoding. Please see description for VkVideoEncodeH264InputModeFlagBitsEXT for some ideas in progress for encode (although that still needs updates to be consistent with the rest of the spec and finalize).

We'll definitely consider how to make use of your feedback. Our main concern is not to delay the work for the first release. Once the main CTS etc. infrastructure and reference implementations are in place, making extension updates to add more features would be a lot easier.

cyanreg commented 2 years ago

Keep in mind the current decode API cannot be extended. You must have all data in a single VkBuffer. So you will need a new command-level API. I think it would be better if you spent time now to make sure it's easily extendable, rather than make hacks later (like specifying each slice's data inside a separate pNext structure) or spend longer debating how a new API would look like, and how it would interfere with the picture-level API.

Plus, even for picture-level decoding, requiring to have all your data inside a single VkBuffer makes the API high CPU overhead. Could you at least consider changing the API to allow for one-slice-per-VkBuffer? Managing a pool of sparse binding/residency buffers without knowing upfront how large your packet is quite heavy. Not everyone wants or thinks it's acceptable to copy megabytes on the CPU per-frame.

aabdelkh commented 2 years ago

Our initial thinking was to later add a codec-specific extension structure to enable per-slice cmds (e.g. VkVideoDecodeH264SliceInfoEXT chained to VkVideoDecodeInfoKHR), or to simply expose a way to allow VkVideoDecodeH264PictureInfoEXT.sliceCount to be 1 for a multi-slice picture if per-slice cmds are supported.

Yes we'll think about this more, and see what should be updated for the first release (even if it's only picture-level) vs. later based on all your valuable feedback. Please keep it coming! :)

zlatinski commented 2 years ago

NVIDIA hardware has an artificial limit of IIRC 4 streams on consumer hardware, but other vendors do not, and instead, schedule their tile decoder units around all streams' tiles that have to be decoded.

@cyanreg, NVIDIA's Vulkan Video driver should not have a limitation on the number of decoder instances (Video Queues and Video Sessions) created. If this is what you are experiencing, it is most likely an unintentional driver bug. Can you please confirm the version of the NVIDIA driver you are using, where you see the limitation with the maximum of 4 decoder streams, and also double-check with the latest NVIDIA BETA driver?

zlatinski commented 2 years ago

What happens with picture-level decoding and multiple streams is that a single high profile high bitrate stream can bottleneck the entire decoder, preventing the other streams from being decoded in realtime. This is the whole point of Vulkan - offloading as much work off the driver as possible, only letting it do scheduling such that all API users get a fair slice, and have limited opportunity to interfere with other work scheduled by other API users.

Please try this with multiple decoder HW instances (video decode queue family instances), not multiple Video Sessions against the same queue family. I'd be surprised if you can still see what you are describing above even with high-profile 4k content.

Keep in mind the current decode API cannot be extended. You must have all data in a single VkBuffer.

For the slice decode Vulkan API, it should look very similar to the vkCmdDecodeVideoKHR(), but will allow for processing of one or more slices, instead of the entire frame. In this case, the VkBuffer will only need to have one or more slices worth of data. This would imply a smaller buffer size, right?

Could you at least consider changing the API to allow for one-slice-per-VkBuffer?

Not all HW codecs can support that.

Plus, even for picture-level decoding, requiring to have all your data inside a single VkBuffer makes the API high CPU overhead.

Regardless of if the data is being copied slice-by-slice or for the entire frame, the entire data must be copied. The only difference is - for the per slice copy one is distributing the copy pieces in time and for the frame, it is done as a single burst.

Managing a pool of sparse binding/residency buffers without knowing upfront how large your packet is quite heavy.

I do not see how managing sparse buffers is complicated? Where is the complexity coming from?

Even when using a regular VkBuffer, please consider the following. Do not think of those buffers as individual pieces but rather as a big circular buffer. Instead of copying the stream data to a CPU-allocated buffer, try copying the data directly over this circular buffer. Then when submitting the buffer to the decoder, the offset should reflect the beginning of the VCL frame data in this circular buffer. The circular buffer should be reset to the beginning after the data left to the end is not sufficient to accommodate the worst-case scenario size of bitstream data.

cyanreg commented 2 years ago

Keep in mind the current decode API cannot be extended. You must have all data in a single VkBuffer.

For the slice decode Vulkan API, it should look very similar to the vkCmdDecodeVideoKHR(), but will allow for processing of one or more slices, instead of the entire frame. In this case, the VkBuffer will only need to have one or more slices worth of data. This would imply a smaller buffer size, right?

If the new API supports multiple slices in a single VkBuffer, then why would the current API need to exist and still be maintained? I think a new API shouldn't outright replace the old one, which still isn't finalized, but rather to complement it.

Could you at least consider changing the API to allow for one-slice-per-VkBuffer?

Not all HW codecs can support that.

The driver could internally represent multiple VkBuffers as a sparse buffer. The spec doesn't say anything about sparse buffers not being supported for decoding. And if the device doesn't support that, then accordingly, the spec should be updated to explicitly mention sparse buffers are not supported for video decoding.

Plus, even for picture-level decoding, requiring to have all your data inside a single VkBuffer makes the API high CPU overhead.

Regardless of if the data is being copied slice-by-slice or for the entire frame, the entire data must be copied. The only difference is - for the per slice copy one is distributing the copy pieces in time and for the frame, it is done as a single burst.

However, if you had multiple VkBuffers, you could upload slices as you process them, rather than let the GPU do nothing until it's got data. And better yet, you could host map them, regardless of whether they appear sequentially in memory (since the slice data offset can not be negative, IIUC), which would also remove the alignment limitation, as you could simply adjust your data pointer, and then use the offset field to zero out where the slice data starts.

Managing a pool of sparse binding/residency buffers without knowing upfront how large your packet is quite heavy.

I do not see how managing sparse buffers is complicated? Where is the complexity coming from?

The complexity comes from the fact that you do not know the size of each slice at the start of decoding. Hence, the VkBuffer would need to be sparsely bound and sparsely resident (which also limits device support even more). You would also be submitting one vkcmd per slice in order to bind the vkdevicememory to the buffer, which isn't a low-overhead operation, and would require you to wait on the previous bind operation to finish before you can start a new one. If you could support a slice per vkbuffer, then you could approach this in multiple ways, either via host mapping, or via explicit uploading to a device_local-only buffer per slice, or copying them via the CPU and invalidating them before you begin decoding.

Even when using a regular VkBuffer, please consider the following. Do not think of those buffers as individual pieces but rather as a big circular buffer. Instead of copying the stream data to a CPU-allocated buffer, try copying the data directly over this circular buffer. Then when submitting the buffer to the decoder, the offset should reflect the beginning of the VCL frame data in this circular buffer. The circular buffer should be reset to the beginning after the data left to the end is not sufficient to accommodate the worst-case scenario size of bitstream data.

The issue with this approach is that packet sizes can wildly change during a stream, and since you cannot resize VkBuffers, you'll often need to tear it down and allocate a new one, or waste memory.

airlied commented 2 years ago

@ceyusa maybe gstreamer needs to be involved also.

zlatinski commented 2 years ago

If the new API supports multiple slices in a single VkBuffer, then why would the current API need to exist and still be maintained? I think a new API shouldn't outright replace the old one, which still isn't finalized, but rather to complement it.

Since not all HW codecs will be able to support slice mode, this feature will be added later on to the spec as an extension and/or codec capability. In order to add this to the Vulkan Video specification, however, we need to have CTS tests written with at least two implementations passing it. Our first release does not have plans for slice mode. The slice more will be added in the form of an extension structure to the current decode/encode command or as an entirely separate video command.

zlatinski commented 2 years ago

The driver could internally represent multiple VkBuffers as a sparse buffer.

This goes against the Vulkan API philosophy, which states that the implementation should not perform any implied work behind the scenes. Please keep in mind that most Vulkan Video APIs are only a thin wrapper on top of the codec HW.

The spec doesn't say anything about sparse buffers not being supported for decoding. And if the device doesn't support that, then accordingly, the spec should be updated to explicitly mention sparse buffers are not supported for video decoding.

Please check https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#features-sparseResidencyBuffer and https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#sparsememory-sparse-memory-aliasing from the general Vulkan spec. to determine if an implementation has support for sparse buffers. If the core Vulkan implementation supports this feature, so is the Vulkan Video core.

You would also be submitting one vkcmd per slice in order to bind the vkdevicememory to the buffer, which isn't a low-overhead operation, and would require you to wait on the previous bind operation to finish before you can start a new one.

The application would be required to wait at the CPU side if it uses a fence to synchronize the memory binding. When using vksemaphore(s), no wait would be required at the CPU side. One or more vkQueueBindSparse commands can be submitted to perform memory binding operations after which the decode command queue submit can be made to wait on the semaphore(s) submitted from the sparse queue.

However, I'm still not clear why do we need all this complexity? Are we talking about devices with very limited memory resources? For h.264/5 compressed content, one frame worth of bitstream data should not exceed 4 MB for 4K resolution and 8 MB for 8K resolution video. If the application were to require support for 4 frames in-flight (low latency caching), we are talking about total buffer requirements of 16MB total for 4K and 32MB for 8K video content. I do not see any issue allocating 32MB per stream on any contemporary devices. The DBP and output images should be a much bigger concern with such content resolutions when it comes to memory requirements.

cyanreg commented 2 years ago

However, I'm still not clear why do we need all this complexity? Are we talking about devices with very limited memory resources? For h.264/5 compressed content, one frame worth of bitstream data should not exceed 4 MB for 4K resolution and 8 MB for 8K resolution video. If the application were to require support for 4 frames in-flight (low latency caching), we are talking about total buffer requirements of 16MB total for 4K and 32MB for 8K video content. I do not see any issue allocating 32MB per stream on any contemporary devices. The DBP and output images should be a much bigger concern with such content resolutions when it comes to memory requirements.

My performance concerns are for the transcoding use-case, where nothing is realtime. The cost for repeated memory copies of small chunks to a 16MiB buffer adds up, and this is for a pure CPU bound operation to memory that's usually not as fast as system RAM, and the bandwidth for which is shared between many different devices on a system. And also I have some concerns about performance on ARM systems, where on low-power systems (phones, SoCs, set-top boxes) memcpy() is infamously slow due to severe lack of RAM bandwidth.

The slice more will be added in the form of an extension structure to the current decode/encode command or as an entirely separate video command.

Glad to hear there's an agreement, the way you phrased it earlier made me think there would be a new API. Could you improve the current structure to prepare for that, then? Having the first slice in the base Info structure, with all other slices in a pNext structure isn't terribly elegant. And, by being able to have each slice in a VkBuffer means you can avoid some of the copy overhead or eliminate it completely if host-mapping. Plus, H.264 has an intra-refresh mode where only a selected row of slices can be sent and refreshed, instead of full keyframes. It's commonly used for streaming over fully unreliable connections, such as over raw IEEE 802.11 for drones. It currently wouldn't be possible to do this without avoiding 2 memory copies and a GPU upload, on a battery-powered device with severe bandwidth limits, if at all decoding is possible.

ceyusa commented 2 years ago

Hi,

Thanks for this discussion :)

Regarding VkVideoDecodeH264ProfileEXT.pictureLayout, I don't have much to say since it was added recently and I have to catch up with wip h264 decoder.

Regarding passing slices to driver rather than full pictures, it make sense to me for perf latency. For example, in my current code I had to alloc and memcpy each slice until receive the whole picture, which indeed is far from optimal. It would be much better to have a way to send slices to driver (avoiding allocs), and even better avoiding the memcpy (but that difficult).

Orthogonal to this, GStreamer decoder's base class supports decoding slices (aka subframes): by enabling the reception of subframes from upstream and decode them; but current specific codec subclasses (which parses each frame to pass them to stateless decoders ) doesn't support it right now (it could be nice to have, though).

zlatinski commented 2 years ago

Hi @cyanreg,

I wanted to try/evaluate the FFMPEG Vulkan-based decoder, but I wasn't able to compile the code. It looks like it is incomplete: https://github.com/cyanreg/FFmpeg/commit/b2347c36dd909d1bb60b73a318828cb0fb17880a#r68090382

Can you please advice on the plan for making this code functional? In addition, is there a plan for adding HEVC (h.265) decode support?

Thanks!

cyanreg commented 2 years ago

@zlatinski I'd finish it if any usable drivers existed. Mesa's implementation for RADV hasn't been merged yet and is months out of date, which for a WIP driver like RADV I'd consider stale, because my AMD machine is my main machine. Though literally minutes ago Dave Airlie just pushed some new related commits to a new branch, which may mean in the next few days stuff may be up to date enough to work on. Nvidia still only ships the feature in their beta drivers. But that's not the issue, the issue is that they're not packaged by any Linux distribution, and installing Nvidia drivers on Linux directly from the upstream is difficult.

Drivers aside, supporting h265 decoding is trivial. We've abstracted all hardware decoding callbacks for wmv/jpeg/vp8/vp9/h264/h265/av1, so supporting any of those is usually as easy as adding 2 lines.

airlied commented 2 years ago

You can sideload radv without installing it for the system. Build my branch with --prefix -Dvulkan-beta=enable --vulkan-drivers=amd --gallium-drivers=

Then ninja, ninja install, then point VK_ICD_FILENAMES at /share/vulkan/icd.d/radeon_icd.x86_64.json

RADV_VIDEO_DECODE=1 vulkaninfo should show the extensions.

There is still some work to get this upstream, but I'm trying to get through it.

zlatinski commented 2 years ago

Thanks, for the update, @cyanreg!

The video TSG goal was to replace the NV-parser with an open-source one. I was under the impression that FFMPEG was further along when it comes to implementing support for h.264/5 Vulkan video decode. It seems like, at this point, Gstreamer also only supports h.264 video decode.

Please let me know if we (NVIDIA) can help in any way - providing HW and/or support installing the Linux drivers, etc. Please feel free to contact me with any support requests at tzlatinski@nvidia.com.

zlatinski commented 2 years ago

You can sideload radv without installing it for the system. Build my branch with --prefix -Dvulkan-beta=enable --vulkan-drivers=amd --gallium-drivers=

https://www.phoronix.com/scan.php?page=news_item&px=RADV-Vulkan-Video-Early

Hi @airlied, Another Vulkan Video implementation is becoming a reality, which I was not aware of. That's fantastic to see.

Thank you for your efforts!

airlied commented 2 years ago

I also have the basics of an intel implementation done for h264 decode using their anv driver.

zlatinski commented 2 years ago

I also have the basics of an intel implementation done for h264 decode using their anv driver.

Great! So, we should have 3 working implementations passing Vulkan Video CTS for the official release of the spec. :).

cyanreg commented 2 years ago

Great! So, we should have 3 working implementations passing Vulkan Video CTS for the official release of the spec. :).

I think you should focus on fixing the spec first.

The first one is simple, VkVideoDecodeH264ProfileEXT.pictureLayout is required to not be 0, but VkVideoDecodeH264PictureLayoutFlagBitsEXT has the value for VK_VIDEO_DECODE_H264_PICTURE_LAYOUT_PROGRESSIVE_EXT set as 0 (spec).

That's still unchanged, and validation layers have implemented this behaviour as-written. Also other vendors are still possibly unhappy with the current frame-orientated API, and my objections about it still stand.

I'm working on finishing the code and it should be done in the next few days. Once that's done, I'll experiment with using a sparse buffer to hold all slices, but if that doesn't work, I think it'll be a real blocker to the API as-is.

zlatinski commented 2 years ago

The first one is simple, VkVideoDecodeH264ProfileEXT.pictureLayout is required to not be 0, but VkVideoDecodeH264PictureLayoutFlagBitsEXT has the value for VK_VIDEO_DECODE_H264_PICTURE_LAYOUT_PROGRESSIVE_EXT set as 0

This has already been fixed and merged to main. If this hasn't been pushed to the latest spec, yet, it will be with the next release push.

Also other vendors are still possibly unhappy with the current frame-orientated API, and my objections about it still stand.

Slice mode will be available after the first release. The issue is not only the availability of implementations and parser supporting slice mode but also the CTS tests would need to be developed and tested which would cause the spec release schedule to slip, again. If you guys want to add slice-based commands to your implementation, the Video TSG is definitely going to consider that API update. It would also ensure compatibility with the current spec and accelerate the Vulkan Video post-release version spec. update for this feature. Please let us know if you have any specific API proposals and/or prototypes on the slice mode.

I'm working on finishing the code and it should be done in the next few days.

Great! Thank you!

cyanreg commented 2 years ago

I've got a few suggestions: How is the concept of queues translated into video decoding? Could a device have multiple queues with the VK_QUEUE_VIDEO_DECODE_BIT_KHR bit set? If so, would it matter to which queue a decode operation is submitted? I know hardware decoders are usually stateless, which would make queue migration possible, so it would be a good idea to clarify this in the spec, such that vendors could possibly produce multi-queue implementations.

Also, could you state in the spec how interlaced content ought to be decoded into a single frame, with each field being interleaved at odd and even lines? As far as I can see, to do this, you must set VK_VIDEO_DECODE_H264_PICTURE_LAYOUT_INTERLACED_INTERLEAVED_LINES_BIT_EXT in VkVideoDecodeH264ProfileEXT.pictureLayout and VkVideoPictureResourceKHR.codedOffset must have a Y offset of 1 for the bottom field, but is that all? Wouldn't the stride need to be doubled such that decoding the second field onto the same image not overwrite the first field, or is that already done by setting VK_VIDEO_DECODE_H264_PICTURE_LAYOUT_INTERLACED_INTERLEAVED_LINES_BIT_EXT? The reference code doesn't support interlaced decoding, but I really want to implement interlaced decoding properly.

zlatinski commented 2 years ago

How is the concept of queues translated into video decoding? Could a device have multiple queues with the VK_QUEUE_VIDEO_DECODE_BIT_KHR bit set?

Yes, and it is reported with the presence of the VK_QUEUE_VIDEO_DECODE_BIT_KHR from VkQueueFamilyProperties::queueFlags when querying the queue properties with vkGetPhysicalDeviceQueueFamilyProperties2(). Please see for example of that https://github.com/nvpro-samples/vk_video_samples/blob/main/vk_video_decoder/libs/VkShell/Shell.cpp#L335. VkQueueFamilyProperties::queueCount will also report the number of queues one can have per device. This number should correspond to the number of (decode in this case) HW instances.

If so, would it matter to which queue a decode operation is submitted? I know hardware decoders are usually stateless, which would make queue migration possible, so it would be a good idea to clarify this in the spec, such that vendors could possibly produce multi-queue implementations.

It shouldn't matter whatever queue the API submits to, as long as the queue's capabilities fulfill the user's expectations. Keep in mind that some implementation may support different codec types on the different queue instances (which is not the case with NVIDIA's codecs).

If multiple decode streams are required, it is recommended that these streams be load balanced across different queues (HW instances) to maximize HW utilization.

All of the above are standard Vulkan concepts.

zlatinski commented 2 years ago

The reference code doesn't support interlaced decoding, but I really want to implement interlaced decoding properly.

The reference decoder sample does support interlaced decoding and it is configured to do it with interleaved lines (not separate fields). The way it works is: The decoder is invoked once for each field, where the top/bottom parameters indicate which line to be filled. After a complete picture frame is produced, the image is sent for presentation. This design can be changed where each field can be sent for presentation. If you have trouble running specific video content please send it to me and I'll see if there is a problem.

cyanreg commented 2 years ago

I was referring to this line and FIXME note next to it, though. https://github.com/nvpro-samples/vk_video_samples/blob/74671960b91f0161c9fb2edcf221b976f75e647e/vk_video_decoder/libs/VulkanVideoFrameBuffer/VulkanVideoFrameBuffer.cpp#L555

Is that a leftover comment that doesn't apply anymore but then how would the code know which field to draw at which parity? After all, some codecs permit bottom field to be present first.

zlatinski commented 2 years ago

https://github.com/nvpro-samples/vk_video_samples/blob/74671960b91f0161c9fb2edcf221b976f75e647e/vk_video_decoder/libs/VulkanVideoFrameBuffer/VulkanVideoFrameBuffer.cpp#L555 Is that a leftover comment that doesn't apply anymore but then how would the code know which field to draw at which parity? After all, some codecs permit bottom field to be present first.

This still needs fixing - it is, currently, ignored by the implementation. This parameter is more important when there is a request for separate top/bottom field used for defining the offset in the image of the second field.

cyanreg commented 2 years ago

This parameter is more important when there is a request for separate top/bottom field used for defining the offset in the image of the second field.

You mean when the 2 fields are interleaved onto the same image, right?
What does happens currently, when decoding first the top field, then the bottom field onto the same image, with the bug still present in the implementation? Row 0 pixels - Bottom field Row 1 pixels - Empty Row 2 pixels - Bottom field Row 3 pixels - Empty

Or

Row 0 pixels - Bottom field Row 1 pixels - Bottom field Row 2 pixels - Bottom field Row N/2 pixels - Bottom field Row N/2 + 1 pixels - Empty Row N/2 + N/2 pixels - Empty

I'm trying to figure out how the implementation knows to double the stride for interlaced images when decoding onto the same image, and I'm also trying to understand how it knows to draw which field at which parity.

For us, we only ever want to decode the fields into the same image.

zlatinski commented 2 years ago

Currently, the implementation only supports line-interleaved mode (we call this NV12): with top/bottom lines decoded based on the following picture parameters:

StdVideoDecodeH264PictureInfoFlags::field_pic_flag, StdVideoDecodeH264PictureInfoFlags::bottom_field_flag, StdVideoDecodeH264PictureInfoFlags::complementary_field_pair, StdVideoDecodeH264ReferenceInfo::PicOrderCnt.

So the top and bottom lines are output based on the StdVideoDecodeH264PictureInfoFlags::bottom_field_flag flag.

NVIDIA also supports NV24 having top and bottom fields where the image (picture) is split into two halves (this is where https://github.com/nvpro-samples/vk_video_samples/blob/74671960b91f0161c9fb2edcf221b976f75e647e/vk_video_decoder/libs/VulkanVideoFrameBuffer/VulkanVideoFrameBuffer.cpp#L555 comes into play) or there are separate field (half-height) images for that. This is very easy to implement at the decoder's side but we haven't done it yet. It is much easier for the Vulkan presentation layer to work with one progressive image made of two line-interlaced fields rather than combining two separate fields on a single image. This would have required two sampled images and manual YCbCr to RGBA conversion, as well as line compositing at the graphics presentation side, or having display HW capable of doing that.

If this is really important for FFMPEG, I can implement that feature (separate top/bottom fields) as part of the next driver BETA release in 3 weeks or so.

cyanreg commented 2 years ago

If this is really important for FFMPEG, I can implement that feature (separate top/bottom fields) as part of the next driver BETA release in 3 weeks or so.

No, it's unneeded for us, we don't support such a mode in ffmpeg at all and don't want to anyway. It's just that TODO comment really threw me off, I thought you made a mistake when you said it's important for separate field mode, because I can't see how it would be important there.

I really think you ought to say in the docs that if StdVideoDecodeH264PictureInfoFlags.field_pic_flag is set, the image will be drawn with twice the stride, with an offset of 1 if StdVideoDecodeH264PictureInfoFlags.bottom_field_flag is set. Otherwise it's not clear that users don't have to do anything on their end other than provide the same setup resource twice, and anyone reading the code in the samples will be as confused as I was.

Anyway, the API seems to have been updated, and I think there's an overlap between VkVideoCapabilityFlagBitsKHR - VK_VIDEO_CAPABILITY_SEPARATE_REFERENCE_IMAGES_BIT_KHR and VkVideoDecodeCapabilityFlagsKHR - VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR. They both refer to the same capabilities, and both are present when checking for the video decode capabilites.

Why does VK_VIDEO_DECODE_CAPABILITY_DEFAULT_KHR exist? The spec says "An implementation must report at least one of VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR or VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR as supported." which leaves no ambiguity or need for a VK_VIDEO_DECODE_CAPABILITY_DEFAULT_KHR value.

Also, the headers seem to have been updated, but there's a problem with them:

libavcodec/vulkan_decode.c:533:13: error: implicit declaration of function 'VK_MAKE_VIDEO_STD_VERSION' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
            VK_STD_VULKAN_VIDEO_CODEC_H264_SPEC_VERSION
            ^
/usr/local/include/vk_video/vulkan_video_codec_h264std.h:32:53: note: expanded from macro 'VK_STD_VULKAN_VIDEO_CODEC_H264_SPEC_VERSION'
#define VK_STD_VULKAN_VIDEO_CODEC_H264_SPEC_VERSION VK_STD_VULKAN_VIDEO_CODEC_H264_API_VERSION_0_9_5
                                                    ^
/usr/local/include/vk_video/vulkan_video_codec_h264std.h:25:58: note: expanded from macro 'VK_STD_VULKAN_VIDEO_CODEC_H264_API_VERSION_0_9_5'
#define VK_STD_VULKAN_VIDEO_CODEC_H264_API_VERSION_0_9_5 VK_MAKE_VIDEO_STD_VERSION(0, 9, 5) // Patch version should always be set to 0

You'll probably want to include vk_video/vulkan_video_codecs_common.h in vulkan/vulkan_beta.h.

zlatinski commented 2 years ago

Please note that both the BETA driver and the sample app may not yet support the latest version of Vulkan Video spec. We are in the process of updating both - waiting for some Video MR to get merged into the main spec. The currently supported versions of the spec of the app and the driver are at https://github.com/nvpro-samples/vk_video_samples/tree/main/vk_video_decoder/include/vk_video and https://github.com/nvpro-samples/vk_video_samples/tree/main/vk_video_decoder/include/vulkan.

cyanreg commented 2 years ago

@zlatinski I've completed the implementation, and dumped the code here - https://github.com/cyanreg/FFmpeg/tree/vulkan_decode It's still very much untested, as currently RADV still needs some additional code to make it work, but we'll make it happen. I'll test on my 1080Ti machine tomorrow.

Can you link the latest Linux beta drivers here for me to install? Also, do those drivers support the newly added VkVideoDecodeCapabilitiesKHR?

cyanreg commented 2 years ago

Some notes: StdVideoH264SequenceParameterSet.pOffsetForRefFrame should be made const.

zlatinski commented 2 years ago

@zlatinski I've completed the implementation, and dumped the code here - https://github.com/cyanreg/FFmpeg/tree/vulkan_decode It's still very much untested, as currently RADV still needs some additional code to make it work, but we'll make it happen. I'll test on my 1080Ti machine tomorrow.

Great, thank you!

Can you link the latest Linux beta drivers here for me to install?

Please download the latest BETA drivers for Vulkan from https://developer.nvidia.com/vulkan-driver. For Linux, it is Version 470.62.22 : https://developer.nvidia.com/vulkan-beta-4706222-linux

Also, do those drivers support the newly added VkVideoDecodeCapabilitiesKHR?

Yes, I believe this version should have the VkVideoDecodeCapabilitiesKHR.

zlatinski commented 2 years ago

Some notes: StdVideoH264SequenceParameterSet.pOffsetForRefFrame should be made const.

Thank you for your contribution! All of the STD pointers are made to be const. The STD update will show-up with the next Vukan spec update.

cyanreg commented 2 years ago

Thanks, I'll install it tomorrow. My implementation currently decodes garbage on RADV, could be a driver issue, so it'll be interesting to see how it compares. You can test it by compiling and running ./ffmpeg_g -init_hw_device "vulkan=vk:0,debug=0" -hwaccel vulkan -hwaccel_output_format vulkan -i test.mkv -vframes 1 -vf hwdownload,format=nv12 -y test.nut.

By the way, I have a question about the queue family on which to perform layout changes to VK_IMAGE_LAYOUT_VIDEO_DECODE_DST_KHR. The validation layer says the compute/graphics queue does not support layout changes to VK_IMAGE_LAYOUT_VIDEO_DECODE_DST_KHR. But the decode queue does not support doing image barriers and layout changes entirely. According to @airlied, the queue family doesn't matter, but still, I'd like to know what's considered more correct.

zlatinski commented 2 years ago

By the way, I have a question about the queue family on which to perform layout changes to VK_IMAGE_LAYOUT_VIDEO_DECODE_DST_KHR. The validation layer says the compute/graphics queue does not support layout changes to VK_IMAGE_LAYOUT_VIDEO_DECODE_DST_KHR. But the decode queue does not support doing image barriers and layout changes entirely.

Decode and encode queues must support image barriers and layout changes, at least to anything, targeting video layouts - decode dst/dpb, encode src/dpb. We are also having a discussion at the TSG to make the transfer operations for those queues mandatory.

cyanreg commented 2 years ago

Decoding works with my branch, by the way. You're free to test it. I'm currently working on supporting VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR, which is proving to be very intrusive. RADV's plans are to only signal VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR, as apparently that's what the hardware wants, and having the driver do copies is forbidden by the spec.

I definitely think video queues should support copy operations. It would save on having a separate submission to copy reference frames when that's required.

cyanreg commented 2 years ago

I have a question regarding future expansion of the codecs supported, and of some extra features modern codecs have, namely film grain. The API is very low level to the point of requiring you to manage the references manually, but what about film grain application? Film grain, in either H.274 (as used in HEVC or H.264) or AV1 is done on the output frames. The reference frames are specified as being free of film grain. In order for this to happen, either a copy has to occur somewhere, and the film grain must be applied on the copy, or the user must apply film grain during presentation, leaving all frames untouched.

How would this be handled in Vulkan decoding? Optimally, I'd like to leave it up to users to apply film grain on their own via a new vkCmd() function that would run on the decode queue and apply film grain to arbitrary frames. This would allow users to eliminate the copy needed on their own if they can apply film grain during presentation (the AV1 specifications allows for this, and some libraries, like libplacebo can already apply both AV1 and H264/5 film grain during presentation).

zlatinski commented 2 years ago

Hi cyanreg,

Sorry for the late reponses.

Decoding works with my branch, by the way. You're free to test it. Great - thank you very much. Would you mind updating this to Vulkan API 1.3.212 (https://github.com/nvpro-samples/vk_video_samples/tree/main/common/include) ? I'll then definitely evaluate this. Are your changes working with h.264 and h.265?

I'm currently working on supporting VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR, which is proving to be very intrusive. RADV's plans are to only signal VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR, as apparently that's what the hardware wants, and having the driver do copies is forbidden by the spec.

The application should not do any copies, but rather only provide DPB images separately from the output. I.e. you need to provide separate images for the output and the DPB, where the application should do nothing with the DPB images, but manage their indexes for the decoder.

zlatinski commented 2 years ago

I have a question regarding future expansion of the codecs supported, and of some extra features modern codecs have, namely film grain. The API is very low level to the point of requiring you to manage the references manually, but what about film grain application? Film grain, in either H.274 (as used in HEVC or H.264) or AV1 is done on the output frames. The reference frames are specified as being free of film grain. In order for this to happen, either a copy has to occur somewhere, and the film grain must be applied on the copy, or the user must apply film grain during presentation, leaving all frames untouched.

How would this be handled in Vulkan decoding? Optimally, I'd like to leave it up to users to apply film grain on their own via a new vkCmd() function that would run on the decode queue and apply film grain to arbitrary frames. This would allow users to eliminate the copy needed on their own if they can apply film grain during presentation (the AV1 specifications allows for this, and some libraries, like libplacebo can already apply both AV1 and H264/5 film grain during presentation).

I agree with you, cyanreg, that this should be at least a separate command if there is a concrete way of doing it. In Vulkan, as an alternative, the application can also use Vulkan Graphics and/or Compute pipelines to do that in a post-processing stage.

Please note, however, that this subject has not yet been considered at the Video TSG. I'll open an issue for that.

cyanreg commented 2 years ago

I tried to install the beta NVIDIA drivers, but the installer failed. Looking at the error logs, the compiler can't find basic headers (stdarg.h/stddef.h), and plenty of #error wait_on_bit_lock() conftest failed! prints appear. Looking through, many others have had those issues with no clear solution. This is on a very standard fresh Debian install, with the kernel being already compiled right on the very machine I'm installing for. I'd try the newly released open-source kernel modules too, but the same installer performs the compilation afaik, and anyway the beta Vulkan drivers are 30+ versions behind the current driver anyway.

zlatinski commented 2 years ago

I tried to install the beta NVIDIA drivers, but the installer failed.

Sorry for the trouble, cyanreg, with installing the NVIDIA Linux drivers. Please ensure you have the kernel sources installed? What are the Debian and Linux kernel versions? What are the options you are selecting during installation?

Please feel free to contact me directly at tzlatinski@nvidia.com and I'm sure we can resolve the issue with the driver installation.

zlatinski commented 1 year ago

Hi cyanreg, I'm not sure if you've tried the latest spec and NVIDIA drivers, but we're getting close to the point where the APIs are quite stable and we don't expect any ABI breaking changes. Please let me know if you have time to finish the FFMPEG support for Vulkan Video and if you would require any assistance or support from us. Thank you!

cyanreg commented 1 year ago

I'll finish it by then. Haven't had time lately, and the constant random API breaks were inconvenient. Have you made any progress on AV1 decoding with/without film-grain, or slice-based decoding and encoding?

zlatinski commented 1 year ago

Have you made any progress on AV1 decoding with/without film-grain, or slice-based decoding and encoding

Those are the things that are still in our Video TSG plans, but will not be implemented until next year.

zlatinski commented 1 year ago

I'll finish it by then. Haven't had time lately, and the constant random API breaks were inconvenient.

Great. Thank you, cyanreg. Please keep in mind that we are still expecting some more Vulkan Video Encode incompatibilities as we advance and improve those API.

cyanreg commented 1 year ago

I see. I found a couple of misnamed/redundant struct members in the codec-specific PPS/SPS structs last time, in case there are still typos, I'll go over them again, hopefully there's enough time to address that if they're still present. I'll be sure to test decoding from a sparse buffer as well, like I described last year.

In terms of priority, I'm looking forward to AV1 decoding more than slice-based decoding, as other existing APIs have synchronization issues when importing into Vulkan for rendering.

KhronosGroup / Vulkan-Docs

Feedback on the video decoding extension(s) #1694