NVIDIA / VideoProcessingFramework

Set of Python bindings to C++ libraries which provides full HW acceleration for video decoding, encoding and GPU-accelerated color space and pixel format conversions
Apache License 2.0
1.31k stars 233 forks source link

How to handle `yuv420p10le`? #274

Open riaqn opened 2 years ago

riaqn commented 2 years ago

I have some videos of yuv420p10le pixel format given by ffmpeg. Should I just specify nvc.PixelFormat.NV12 when creating decoder? In that case the downloaded picture seems off-color and many artifacts. I suppose this means the decoder doesn't handle 10bit LE very well. But when I look at the code https://github.com/NVIDIA/VideoProcessingFramework/blob/885296a1066128d36719d699f7c76f77e9c081ea/PyNvCodec/TC/src/NvDecoder.cpp#L275

it seems to be handling depth correctly. I suppose then it has to be about the bytes order - does Nvidia codec SDK only supports BE?

rarzumanyan commented 2 years ago

Hi @riaqn

High bit depth is not currently supported by VPF.

riaqn commented 2 years ago

@rarzumanyan Thank you for your quick reply. I'm reading the nvidia sdk document and couldn't find a field about byte order when creating decoders. Or do you mean packet can simply be converted to BE in software before passing to the decoder?

rarzumanyan commented 2 years ago

@riaqn

IIRC both LE and BE shall be supported.

An easy way to check would be to decode your file with ffmpeg using NV HW-accelerated decoders (h264_cuvid or hevc_cuvid). Also, refer to Nvdec support matrix to check if your GPU supports 10 bit decoding on HW level.

Encoded packet contains information required to reconstruct the samples of video frame (motion vectors, transform coefficients and such), it has nothing to do with endiannes.

P. S. Nvidia developers forum would be a better place to address this question.

riaqn commented 2 years ago

@rarzumanyan Thank you for the tip. I used ffmpeg to decode:

ffmpeg -c:v hevc_cuvid  -i 'test.mp4' -pix_fmt yuv420p10le out.yuv

(I can observe GPU usage high, therefore definitely using GPU) then play the raw YUV by

ffplay -f rawvideo -pixel_format yuv420p10le -video_size 7260x3630 out.yuv

and the picture seems correct.

Therefore I guess both the GPU and the driver are 10bit HEVC ready. Do you think adding support to VPF will be a big workload? I'm not in urgent need(only a few videos of my media collection is 10bit hevc) so I can wait until you have time.

rarzumanyan commented 2 years ago

@riaqn

Do you think adding support to VPF will be a big workload? I'm not in urgent need

It's not that much of a job, but not the highest prio as well. Most of the effort would be to modify existing Surface classes to support higher bit depth.

I'll take this up once I refactor code or the bigger issue pops up which requires the high bit depth support.

riaqn commented 2 years ago

@rarzumanyan Sounds good! Looking forward to it.

Renzzauw commented 2 years ago

@rarzumanyan Hi Roman,

I stumbled upon this exact same issue while trying to decode yuv420p10le pixel format videos, so I was wondering if this was still something you were planning to support in the future?

Thanks!

rarzumanyan commented 2 years ago

Hi @Renzzauw @riaqn

Yes, it's still something I'd like to support ;) Just under water with other tasks at the moment, my apologies.

riaqn commented 2 years ago

I don't mean to hijack the thread, but my library https://github.com/riaqn/python-nvidia-codec supports that. :smile: but of course you have to used a patched pyav (with bitstream filter and color_range and color_space) and mess a bit.

Renzzauw commented 2 years ago

@rarzumanyan Thanks for the update!

@riaqn Thanks for sharing that, it sounds like a useful library and it being written fully in python. Nice work! My current video processing application depends on VPF at the moment, and I am not in a rush to support other pixel formats not supported by VPF, but I'll keep an eye out on the developments of your library, it might be useful in the future!

rarzumanyan commented 2 years ago

@Renzzauw, @riaqn

Please check out issue_274 feature branch. It adds 10 and 12 bit decoding support. I only checked 10 bit HEVC (don't have too many HBD videos nearby). This branch introduces 2 new pixel formats:

  1. P10
  2. P12

Both are basically 16-bit NV12. Please test it and LMK if it fulfills the request, I'll merge to master then.

Renzzauw commented 2 years ago

Hi @rarzumanyan, thanks for checking this out so quickly! I'm currently in the process of testing your changes.

I'm testing it by adjusting my current video processing pipeline. I currently use the conversion chain NV12 -> YUV420 -> RGB -> RGB_PLANAR, how do I adjust this to work with P10?

I tried the following conversions, but it gives me "Unsupported pixel format conversion errors":

If I keep the chain as is, it runs problem-free, but the output video looks odd (which I would expect, given it's not 8-bit but a 10-bit video I'm inputting).

rarzumanyan commented 2 years ago

how do I adjust this to work with P10?

By adding HBD pixel formats and color conversion support ;) Existing color conversion part of VPF only supports 8 bit color. Alternatively, some sort of 10 bit > 8 bit range mapping has to be done (doesn't exist yet in VPF yet).

Renzzauw commented 2 years ago

Ahh, that makes sense, haha!

Would it be possible for you to support such conversions or range mappings?

For my use-case specifically, it would be nice to be able to decode HBD videos, convert to planar RGB (so I can use the PyTorch extension & torchvision functionalities), converting it back to Surface and encoding the result in 8-bit (doesn't have to be a higher bit depth per se).

By the way, I just tried simply decoding one of my 10-bit videos (without doing anything else, such as pixel format conversions etc.), and this seems to work fine on my end.

rarzumanyan commented 2 years ago

@Renzzauw

Would it be possible for you to support such conversions or range mappings?

NPP does support such type of conversion but there's a caveat here. Scaling factor is supported only for 32-bit floating point images:

dstPixelValue = dstMinRangeValue + scaleFactor * (srcPixelValue - srcMinRangeValue)
scaleFactor = (dstMaxRangeValue - dstMinRangeValue) / (srcMaxRangeValue - srcMinRangeValue).

E. g. this is the function for conversion from 16 to 8 bit:

NppStatus nppiScale_16u8u_C1R_Ctx ( 
 const Npp16u *pSrc, 
 int nSrcStep, 
 Npp8u *pDst,
 int nDstStep,
 NppiSize oSizeROI,
 NppHintAlgorithm hint,
 NppStreamContext nppStreamCtx)  

WIll make the same type of conversion disregarding the actual bit depth (e. g. 10 or 12 bit). I'm searching for a way to implement a proper mapping of 10- and 12-bit images to 8 bit but so far I didn't find anything.

rarzumanyan commented 2 years ago

@riaqn I like the idea but there are couple important things to keep in mind:

Have no idea how to copy-paste the code without running into legal issues ;) I don't own this project and shall be extra careful regarding this matter.

Also, since it's a Video processing framework, it would be nice to pay some attention to bit depth conversion because this is part of HDR formats which may have different mapping algos between bit depths. Nvdec and Nvenc natively support HBD and hence may be used to decode, say HDR sequences. I would probably have to look onto more generic CUDA-accelerated conversion.

riaqn commented 2 years ago

@rarzumanyan Yes I understand your concern.

WRT bit depth conversion - you mean I can't simply truncate 16bit and get the 8bit?

rarzumanyan commented 2 years ago

@riaqn

you mean I can't simply truncate 16bit and get the 8bit?

Yes, simple clamping will cripple the images. 8 bit video has 0 as black and 255 as white. 10 bit video has 0 as black and 1023 as white (and so on for 12/14/16 bit).

Hence if I simply convert 10 bit video to 8 bit by saturating and clamping at 255, such video will be very dark.

A naive way to avoid that would be to divide every pixel by 4 and map [0;1023] range to [0;255] range but this may not preserve artistic intent (e. g. 10 bit may be captured with "flat" profile and has very "dull" histogram to preserve as many halftones as possible).

Long story short, in order to do proper range mapping one need to take into account the transfer function.

riaqn commented 2 years ago

@rarzumanyan yes by 'truncation' I mean bit truncation. I.e.: take the most significant 8 bits of the 10bits, which is the 'naive way' you mentioned. Could you maybe provide some examples of why this is bad? Thank you.

rarzumanyan commented 2 years ago

@Renzzauw

With couple NPP tricks I've implemented simplest p10>nv12 and p12>nv12 cconv by integer division and range conversion. To naked eye cconv looks good, so IMO it can be used until someone comes up with better conversion algo.

@riaqn

Could you maybe provide some examples of why this is bad?

It's not bad it's just the simplest straightforward approach. From my experience (which may not be relevant, I've not extensively worked with HBD videos) HBD videos are used for editing and such and often users prefer to do a more sophisticated mapping from HBD to 8 bit. LSB truncation is okay-ish as a measure to increase amount of videos VPF can handle but may be sub-optimal for advanced editing.

riaqn commented 2 years ago

@Renzzauw I'm not sure if this is useful to you, but it seems that pytorch 0.12 now includes GPU decoding already: https://pytorch.org/vision/stable/generated/torchvision.io.VideoReader.html#torchvision.io.VideoReader

And even if you don't pytorch, there is also https://github.com/dmlc/decord#bridges-for-deep-learning-frameworks

But I don't really what are the catches of using these. Maybe @rarzumanyan will know? Also I'm not sure if those librairies includes YUV->RGB conversion.

Renzzauw commented 2 years ago

@rarzumanyan Thanks for looking into implementing these conversions! I'll check out the new changes.

@riaqn Thanks for providing these resources, these are certainly helpful and welcome!

I did some research into various python decoding & encoding options, and during that time VPF appeared most appealing to me, as it also provided encoding solutions, hence I probably didn't look that deep into torchvision's VideoReader, but I do wonder how these two compare performance-wise.

Decord also seems nice, but by the looks of the issue board, it is no longer actively being maintained?

Anyway, appreciate the help from both of you!

riaqn commented 2 years ago

@Renzzauw yes Decord seems to need significant patch to work with ffmpeg5. and the decoder in torchvision is not very optimized. For example, they don't provide an interface to do batch screenshooting into a N x 3 x H x W tensor, which would be much more efficient than getting frames one by one.

I think I will still work on my library so that it implement demux without pyav.

ADongGu commented 11 months ago

@Renzzauw

With couple NPP tricks I've implemented simplest p10>nv12 and p12>nv12 cconv by integer division and range conversion. To naked eye cconv looks good, so IMO it can be used until someone comes up with better conversion algo.

@riaqn

Could you maybe provide some examples of why this is bad?

It's not bad it's just the simplest straightforward approach. From my experience (which may not be relevant, I've not extensively worked with HBD videos) HBD videos are used for editing and such and often users prefer to do a more sophisticated mapping from HBD to 8 bit. LSB truncation is okay-ish as a measure to increase amount of videos VPF can handle but may be sub-optimal for advanced editing.

My cuda version is 11.3. error: 'nppiDivC_16u_C1RSfs_Ctx' was not declared in this scope. What is your version?

Also, if you go from 10 bits to 8 bits, the code will still be 8 bits. How to deal with it if you want to directly encode 10 bits.

Hope to get your reply, best wishes.

riaqn commented 11 months ago

My library is here, feel free to try: https://github.com/riaqn/python-nvidia-codec