Use image import API instead of streams?

philipl commented 2 years ago

Just thinking out loud here, given all the various stream error scenarios that have been reported. Is it really worth it to use the streams to be able to export, as opposed to using the import API to import an EGL image and then copy the video frame into it? This is the approach that the nvdec interop in mpv does (as it's the only approach that works for non-EGL apis), Yes, you need a copy, but GPU->GPU copies are trivially cheap and the mechanism is much simpler.

elFarto commented 2 years ago

Maybe it might be worth revisiting it, but it was never meant to be an image in the first place. I was hoping I could just use cuMemExportToShareableHandle on the chunk of memory NVDEC returned, but that didn't work. The 'solution' just kept growing as I threw code at it to make it work.

Creating a set of EGLImages (I guess one for each VASurface), copying into those and then exporting them would make more sense now, but I was really hoping to avoid having an copies. But the limiting factor that is that the OpenGL driver doesn't like binding a pitch-linear imported buffer to a GL_TEXTURE_2D.

It might be possible to use Vulkan for this (although this seems like it's asking for more problems). It looks like it's possible to export the memory from CUDA, rather than the image to Vulkan, and re-export it as a DMA-BUF.

philipl commented 2 years ago

Which Vulkan extension support exporting a dma-buf on nvidia? I couldn't find one.

elFarto commented 2 years ago

VK_EXT_external_memory_dma_buf.... which I've noticed isn't available on NVIDIA yet, sigh

philipl commented 2 years ago

Yeah, that's what I discovered too.

So I do think it's worth switching to import->copy. You are making a copy anyway, because you're able to unmap after exporting, so if you import the image, then copy the frame, and then unmap, you're not making more copies. And you should be able to copy directly from the buffer to the imported frame (copy from DEVICE to ARRAY). Another bonus is that it won't require much to be added to https://github.com/FFmpeg/nv-codec-headers to use the loader (which I've started looking at)

philipl commented 2 years ago

Started looking at this more. The primary challenge is that switching to import means you need to introduce usage of a new API to create the images. You can't create EGL Images from thin air. In mpv we use GL, and we could conceptually do that here too. I investigated using gbm, which is theoretically the correct API if you just want a generic buffer of memory (ahem). But cuda can't import gbm dma-buf fds (of course) and the set of formats that nvidia supports for creating gbm bos is so limited that it doesn't even support the basic stuff we could use like R8 and RG1616, let alone NV12/P016.

They really know how to make this hard.

elFarto commented 2 years ago

I've been thinking about this one aswell. I'm not sure there's any need to change it just yet. The stream errors we've been seeing were actually just caused by some bad cleanup code, and the EGL issues were seemingly caused by some naive init code. The only EGL issue remaining seems to be it refusing to export a buffer, which I'm not sure is a streams issue.

philipl commented 2 years ago

True enough. Did you try exporting the frames as two single plane images? Maybe it can handle that whereas it can't handle a multi plane image.

philipl commented 2 years ago

@cubanismo, would also really appreciate your thoughts here. I suspect you'll say that the existing eglstreams usage is the best way to do this.

cubanismo commented 2 years ago

They really know how to make this hard.

I swear we're not doing it on purpose. :-)

There has been much demand for GBM from the community for a long time. Rather than wait until it was perfect, we shipped enough to support basic Wayland use cases when we did. Supporting more formats will take additional effort, but it's good to know there are use cases/demand from the community, as that always helps justify such work.

VK_EXT_external_memory_dma_buf

Not yet.

But the limiting factor that is that the OpenGL driver doesn't like binding a pitch-linear imported buffer to a GL_TEXTURE_2D.

If you look at how the NV EGL driver reports linear format modifier support, it should indicate the linear format modifier only supports "external" texture targets. Our GPU architecture's support for rendering to pitch/linear surfaces has limitations that are difficult to report through the OpenGL APIs, so we don't advertise support for binding linear buffers to things that might end up being renderable surfaces, like GL_TEXTURE_2D textures. If you're curious, you can find more details on the actual limitations in the recently released Vulkan extension VK_NV_linear_color_attachment, which actually does expose support for rendering to linear surfaces in Vulkan.

Getting back to the goal here, you should have better luck binding linear buffers to a GL_TEXTURE_EXTERNAL_OES texture. This may not work in OpenGL proper IIRC because of spec technicalities, but should in at least GLES 2 contexts.

elFarto commented 2 years ago

Getting back to the goal here, you should have better luck binding linear buffers to a GL_TEXTURE_EXTERNAL_OES texture. This may not work in OpenGL proper IIRC because of spec technicalities, but should in at least GLES 2 contexts.

I think this gets to the root of the problem, we don't have any control on who will use the DMA-BUF, and they may be using OpenGL.

elFarto commented 2 years ago

Seems I've been able to get an alternative method for image creation working, thanks in large part to NVIDIA open sourcing their driver.

I had early on entertained the idea of talking directly to the NVIDIA kernel driver, but quickly dismissed that after a brief look at the ioctls. However with the source now available, reverse engineering them is actually straight forward.

This lets us ask the driver to create a buffer, then export it twice, once for CUDA, and once as a DMA-BUF. We can then import it into CUDA with cuImportExternalMemory and cuExternalMemoryGetMappedMipmappedArray, all without an EGL call in sight. This also lets us side-step the nasty problem of not being able to delete buffers after sharing them via EGLStreams.

In theory we could extend this method and just reverse engineer NVDEC (or more likely VDPAU) and do away with CUDA entirely, which is interesting given the large power draw increase it causes. Although I wouldn't expect this to be an easy task.

I've uploaded a WIP prototype to the direct-backend branch, although be warned since we're dealing directly with the driver I can't guarantee compatibility on anything other than my trusty 1060. Also I'm leaking memory all over the place, so don't run it for extended periods.

philipl commented 2 years ago

That's snazzy! I expect it to be tricky, but not impossible to reverse engineer nvdec. The driver won't communicate that much as it's all based on command passing, but perhaps the commands are documented in the headers and if you were to read through the code that implements v4lm2m in tegra you might be able to piece it together. Regardless of all that, it's awesome that you were able to pick out how to allocate buffers directly!

cubanismo commented 2 years ago

Note we strongly discourage using the resource manager APIs directly. They are subject to change at any time and provide no backwards or forwards ABI or source-level compatibility. E.g., kernel components corresponding to some driver version 515.xx.yy could be incompatible with userspace components from driver version 515.xx.zz.

elFarto commented 2 years ago

I understand the difficulties in using those APIs, but if you guys have managed to maintain your driver against a kernel with no backwards or forwards ABI or source-level compatibility, how hard can it be? 😁

But there are limitations for us to stay with the CUDA/NVDEC APIs. Firstly, it's in no way designed to be shoved into a browser with its security sandbox. I'm not blaming anyone for that, that's just what it needs to do it job. But that job isn't really playing videos in a browser on a laptop.

The second is that this library isn't in full control over how many surfaces the client wants. We need to keep duplicate surfaces for every decode surface used, and we can't free them as CUDA/EGLStreams makes impossible. We also have to overprovision the amount of decode surfaces NVDECODE allocates, as they can't be increased during playback. All this uses up VRAM and at 8K that can be considerable, the current version of the driver takes nearly 2GiB on an 8K video, and the experimental direct backend takes about 1.5GiB (I have no idea why this is yet).

The third is efficiency. The current decode process is: NVDECODE writes to it's internal decode surface, then NVDECODE copys it to a chunk of memory in NV12 pitch-linear format, then we copy that back to a CUarray and share that over a DMA-BUF. I'm pretty sure those two copies can be eliminated based on how NVDECODE allocates its decode surfaces. Combined with the second point above, that's a lot of bandwidth and potentially battery power being wasted.

Now I realise there's not much you can do about those issues, but I thought I should at least document the issues/constraints we're working with, and reasons why we're persuing alternative mechanisms.

elFarto / nvidia-vaapi-driver

Use image import API instead of streams? #15