Frames and Timestamps(PTS) matching

DanCorvesor commented 1 year ago

Describe the bug There seems no obvious way (or example of how to) match the decoded frames with their timestamps specifically when it's required to flush the decoding queue.

Specifically, in the SampleDecode (with Standalone Demuxer) and SampleDemux, the packets are consumed sequentially and the example shows how to also get the timestamps (which is what I want). How can I match these timestamps with the frames, in particular if some of the frames are added to the queue and are not returned immediately so they have to be flushed at the end.

It would be great if I could have some advice/guidance about how to go about this or whether this is not possible. When I print the timestamps in the above examples, they are not in order. I'm not a GPU expert but it makes intuitive sense this is because of the parallelism under the hood but again the way in which you can relate this information to the processed frames would be very helpful.

In addition, it would be great to know if this were possible for the Pytorch examples, as you're looping through the frames to be able to have a way to get the correct timestamps and hence only select those that match an input fps.

Thanks in advance,

Daniel

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor Take a look here: https://github.com/NVIDIA/VideoProcessingFramework/blob/d8d5d1874c65ecfe6a82db2c282182e1b865452e/tests/test_PyNvDecoder.py#L201-L217

DanCorvesor commented 1 year ago

Hi @RomanArzumanyan . Thanks so much for getting back, it's really appreciated. This helped a lot - I was able to successfully match the timestamps that I get reading an mp4 video on CPU with PyAV which is what I want.

However, if I use your code in your Pytorch example to convert the color chain into the rgb format and then convert into a pytorch tensor (and then back to a numpy array) - the frames that I get are vastly different than those that I get back with PyAV on CPU. Am I missing something here - I presume I should be able to reproduce them? I can provide a code snippet and the video I am using if required?

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor

Attaching a screenshot and color conversion code would be helpful.

VPF has pretty accurate yuv 2 rgb color conversion, we just need to make sure that proper color space and range are used.

DanCorvesor commented 1 year ago

Hi Roman,

Sorry for the delay, I have attached the video I have been testing on (sample-5s.mp4). And some example outputs below. The code used to generate this is provided: I think timestamps seem correct but I have an issue with the colour conversion as mentioned above. Let me know if you need anything else to investigate

Code:

import time
import os

# Starting from Python 3.8 DLL search policy has changed.
# We need to add path to CUDA DLLs explicitly.
import sys

if os.name == "nt":
    # Add CUDA_PATH env variable
    cuda_path = os.environ["CUDA_PATH"]
    if cuda_path:
        os.add_dll_directory(cuda_path)
    else:
        print("CUDA_PATH environment variable is not set.", file=sys.stderr)
        print("Can't set CUDA DLLs search path.", file=sys.stderr)
        exit(1)

    # Add PATH as well for minor CUDA releases
    sys_path = os.environ["PATH"]
    if sys_path:
        paths = sys_path.split(";")
        for path in paths:
            if os.path.isdir(path):
                os.add_dll_directory(path)
    else:
        print("PATH environment variable is not set.", file=sys.stderr)
        exit(1)

import av
import numpy as np
import matplotlib.pyplot as plt
import pycuda.driver as cuda
import PyNvCodec as nvc
import torch

try:
    import PytorchNvCodec as pnvc
except ImportError as err:
    raise (
        f"""Could not import `PytorchNvCodec`: {err}.
Please make sure it is installed! Run
`pip install git+https://github.com/NVIDIA/VideoProcessingFramework#subdirectory=src/PytorchNvCodec` or
`pip install src/PytorchNvCodec` if using a local copy of the VideoProcessingFramework repository"""
    )  # noqa
import logging
import warnings
from math import floor, modf

class VideoReader(object):
    """
    A help class to read video using PyAv
    That library allows to read every frame from a video and read presentation timestamp (PTS) as described
    https://en.wikipedia.org/wiki/Presentation_timestamp.
    This is highly useful to accurately read videos that are recorded with variable frame rate.

    Note: Video frames might not always be stored in chronological order.
    This function will skip frames where their PTS is earlier than the previous frame

    Parameter:
    ----------
    video_file: str
        video file path to be loaded
    fps: int, optional
        frame rate will be used to return video frames
    debug: bool, optional
        to run it in debug mode

    See Also
    --------
    frame_iterator: returns a generator to iterate over all frames

    """

    def __init__(self, video_file, fps: int = 1000, debug: bool = False):
        if not debug:
            av.logging.set_level(av.logging.CRITICAL)
        self.video_file = video_file
        self.debug = debug
        self.requested_fps = fps
        self._open_video_container()
        if self.debug:
            logging.info(self)

    def _open_video_container(self):
        self.return_ms = 1000.0 / self.requested_fps
        self.container = av.open(self.video_file, metadata_errors="ignore")
        self.last_returned = -1
        self.last_returned_ms = -1
        self.stream = self.container.streams.video[0]
        self.stream.thread_type = "AUTO"
        self.stream_itr = iter(self.container.decode(self.stream))
        self.stream_fps = float(self.stream.rate) if self.stream.rate else None
        self.display_aspect_ratio = self.stream.codec_context.display_aspect_ratio
        if self.display_aspect_ratio:
            self.frame_height = self.stream.codec_context.height
            self.frame_width = int(self.frame_height * self.display_aspect_ratio)
        if self.debug:
            self.duration_second = float(self.stream.duration * self.stream.time_base)
            self.num_frames = int(floor(self.duration_second) * min(self.requested_fps, self.stream_fps))
            self.num_frames += floor(modf(self.duration_second)[0] * 1000 / self.return_ms) + 1

    def seek(self, target_sec: float, any_frame: bool = True, backward: bool = True):
        try:
            target_time = int(target_sec / self.stream.time_base) + self.stream.start_time
        except TypeError:
            target_time = int(target_sec / self.stream.time_base)
        self.stream.seek(target_time, any_frame=any_frame, backward=backward)

    def reload(self):
        self._open_video_container()

    def close(self):
        self.container = None
        self.stream = None

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

    def __str__(self):
        return f"Loaded {self.video_file}\nRecorded at {self.stream_fps} FPS read at {1000. / self.return_ms} FPS"

    def __iter__(self):
        return self.frame_iterator()

    def frame_iterator(self):
        """
        Create a generator to go through video frames

        Note
        ----
        This function skips frames with invalid pts or invalid data

        Returns
        -------
        3D Numpy array
            3D numpy array of shape (height, width, 3) in rgb order
        float
            Presentation timestamp that is computed by using video and frame metadata
        """
        while True:
            try:
                frame = next(self.stream_itr)
                print(type(frame))
            except av.InvalidDataError as err:
                logging.info(err)
                continue
            except StopIteration:
                if self.debug:
                    logging.info("End of the stream")
                break
            if frame.pts is None:
                if self.debug:
                    warnings.warn("NO PTS: Frame skipped")
                continue
            time_ms = float(frame.pts * self.stream.time_base) * 1000
            if self.should_return(time_ms):
                if self.display_aspect_ratio:
                    yield frame.to_ndarray(width=self.frame_width, height=self.frame_height, format="rgb24"), time_ms
                else:
                    yield np.array(frame.to_image()), time_ms

    def should_return(self, ms):
        if ms // self.return_ms > self.last_returned:
            self.last_returned = ms // self.return_ms
            self.last_returned_ms = ms
            return True
        elif self.debug:
            logging.log(10, f"Skipping: {ms} as {self.last_returned_ms}")
        return False

class cconverter:
    """
    Colorspace conversion chain.
    """

    def __init__(self, width: int, height: int, gpu_id: int):
        self.gpu_id = gpu_id
        self.w = width
        self.h = height
        self.chain = []

    def add(self, src_fmt: nvc.PixelFormat, dst_fmt: nvc.PixelFormat) -> None:
        self.chain.append(nvc.PySurfaceConverter(self.w, self.h, src_fmt, dst_fmt, self.gpu_id))

    def run(self, src_surface: nvc.Surface) -> nvc.Surface:
        surf = src_surface
        cc = nvc.ColorspaceConversionContext(nvc.ColorSpace.BT_601, nvc.ColorRange.MPEG)

        for cvt in self.chain:
            surf = cvt.Execute(surf, cc)
            if surf.Empty():
                raise RuntimeError("Failed to perform color conversion")

        return surf.Clone(self.gpu_id)

class VideoReaderVPF(object):

    def __init__(self, gpu_id, enc_path):

        self.gpu = gpu_id

        self.cuda_ctx, self.cuda_str = self.initialise_cuda_ctx()

        self.nvDmx = nvc.PyFFmpegDemuxer(enc_path, {})
        self.width = self.nvDmx.Width()
        self.height = self.nvDmx.Height()
        self.nvDec = nvc.PyNvDecoder(
            self.width, self.height, self.nvDmx.Format(), self.nvDmx.Codec(), self.cuda_ctx.handle,
            self.cuda_str.handle,
        )
        self.to_rgb = self.initialise_colour_chain_converter()
        self.decoded_frames = 0

    def __iter__(self):
        return self.frame_iterator()

    def initialise_cuda_ctx(self):
        cuda.init()
        cuda_ctx = cuda.Device(self.gpu).retain_primary_context()
        cuda_ctx.push()
        cuda_str = cuda.Stream()
        cuda_ctx.pop()
        return cuda_ctx, cuda_str

    @staticmethod
    def surface_to_tensor(surface: nvc.Surface) -> torch.Tensor:
        """
        Converts planar rgb surface to cuda float tensor.
        """
        if surface.Format() != nvc.PixelFormat.RGB_PLANAR:
            raise RuntimeError("Surface shall be of RGB_PLANAR pixel format")

        surf_plane = surface.PlanePtr()
        img_tensor = pnvc.DptrToTensor(
            surf_plane.GpuMem(),
            surf_plane.Width(),
            surf_plane.Height(),
            surf_plane.Pitch(),
            surf_plane.ElemSize(),
        )
        if img_tensor is None:
            raise RuntimeError("Can not export to tensor.")

        img_tensor.resize_(3, int(surf_plane.Height() / 3), surf_plane.Width())
        img_tensor = img_tensor.type(dtype=torch.cuda.FloatTensor)
        img_tensor = torch.divide(img_tensor, 255.0)
        img_tensor = torch.clamp(img_tensor, 0.0, 1.0)

        return img_tensor

    def initialise_colour_chain_converter(self) -> cconverter:
        to_rgb = cconverter(self.width, self.height, self.gpu)
        to_rgb.add(nvc.PixelFormat.NV12, nvc.PixelFormat.YUV420)
        to_rgb.add(nvc.PixelFormat.YUV420, nvc.PixelFormat.RGB)
        to_rgb.add(nvc.PixelFormat.RGB, nvc.PixelFormat.RGB_PLANAR)
        return to_rgb

    def convert_pts_to_ms(self, pts):
        return float(pts * self.nvDmx.Timebase()) * 1000

    def frame_iterator(self):
        dec_frames = 0
        packet = np.ndarray(shape=(0), dtype=np.uint8)
        out_bst_size = 0
        while self.nvDmx.DemuxSinglePacket(packet):
            in_pdata = nvc.PacketData()
            self.nvDmx.LastPacketData(in_pdata)
            out_pdata = nvc.PacketData()

            surf = self.nvDec.DecodeSurfaceFromPacket(in_pdata, packet, out_pdata)

            if not surf.Empty():
                dec_frames += 1
                out_bst_size += out_pdata.bsl
                timestamp = self.convert_pts_to_ms(out_pdata.pts)
                # Convert to planar RGB
                rgb_pln = self.to_rgb.run(surf)
                src_tensor = self.surface_to_tensor(rgb_pln)
                self.decoded_frames += 1
                yield src_tensor, timestamp

        while True:
            out_pdata = nvc.PacketData()
            surf = self.nvDec.FlushSingleSurface(out_pdata)
            # print(out_pdata)
            if not surf.Empty():
                out_bst_size += out_pdata.bsl
                timestamp = self.convert_pts_to_ms(out_pdata.pts)
                rgb_pln = self.to_rgb.run(surf)
                src_tensor = self.surface_to_tensor(rgb_pln)
                self.decoded_frames += 1
                yield src_tensor, timestamp

            else:
                break

def main(gpu, enc_path):
    # Access gpu for first time outside of loop for fair comparison
    torch.zeros(10, 10).cuda()

    video_reader_timestamps = []
    video_reader_frames = {}
    start_video_reader = time.time()
    video_reader = VideoReader(enc_path)
    for frame, ms in video_reader:
        video_reader_timestamps.append(round(ms, 5))
        video_reader_frames[round(ms, 5)] = torch.from_numpy(frame).cuda()
    print(f'Time taken for video reader streaming {time.time() - start_video_reader} seconds')

    vpf_timestampsi = []
    vpf_timestampso = []
    vpf_frames = {}
    start_vpf = time.time()

    video_reader_vpf = VideoReaderVPF(gpu, enc_path)

    for tensor, ms in video_reader_vpf:
        round_ms = round(ms, 5)
        vpf_frames[round_ms] = tensor
        vpf_timestampso.append(round_ms)

    print(len(vpf_timestampso))

    print(f'Time taken for vpf streaming {time.time() - start_vpf} seconds')
    print(video_reader_vpf.decoded_frames)
    print(f'Number of vpf timestamps: {len(vpf_timestampso)}, number of video reader timestamps: \
          {len(video_reader_timestamps)}')
    print(vpf_timestampso == video_reader_timestamps)

    diff = []
    for element in vpf_timestampso:
        if element not in video_reader_timestamps:
            diff.append(element)
    print(f'Timestamps that are different between vpf and the video reader (up to 5 decimal places {diff}')

    for vid_reader_timestamp, vid_reader_frame in video_reader_frames.items():
        f, axarr = plt.subplots(1, 3)
        # permuted_frame = vid_reader_frame.permute(2, 0, 1).cpu()
        axarr[0].imshow(vid_reader_frame.cpu())
        axarr[0].set_title('PyAV', fontstyle='italic')
        matching_vpf_frame = vpf_frames[vid_reader_timestamp].permute(1, 2, 0).cpu()
        vid_reader_frame = vid_reader_frame.cpu()
        axarr[1].imshow(matching_vpf_frame)
        axarr[1].set_title('VPF', fontstyle='italic')
        axarr[2].imshow(abs(vid_reader_frame - matching_vpf_frame))
        axarr[2].set_title('Diff', fontstyle='italic')
        print(np.histogram(abs(vid_reader_frame - matching_vpf_frame)))
        f.savefig(f'vpf/outputs/{vid_reader_timestamp}.png')
        plt.close()

if __name__ == "__main__":

    print("This sample decodes input video to raw YUV420 file on given GPU.")
    print("Usage: SampleDecode.py $gpu_id $input_file.")

    if len(sys.argv) < 3:
        print("Provide gpu ID, path to input file")
        exit(1)

    gpuID = int(sys.argv[1])
    encFilePath = sys.argv[2]

    main(gpuID, encFilePath)

https://github.com/NVIDIA/VideoProcessingFramework/assets/44499515/8af51601-5391-407a-94bd-7c4d4e09c291

0 0 ![700 0](https://github.com/NVIDIA/VideoProcessingFramework/assets/44499515/9eca98ac-3b35-4abc-859 3800 0 9-7082cb8b2cca) 5366 66667

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor

You're converting YUV > RGB with hard-coded parameters:

    def run(self, src_surface: nvc.Surface) -> nvc.Surface:
        surf = src_surface
        cc = nvc.ColorspaceConversionContext(nvc.ColorSpace.BT_601, nvc.ColorRange.MPEG)

Actual colorspace and color range may be different hence the difference between VPF and PyAV results. You can get color conversion params using PyFfMpegDemuxer class as it's shown here: https://github.com/NVIDIA/VideoProcessingFramework/blob/d8d5d1874c65ecfe6a82db2c282182e1b865452e/tests/test_PyFfmpegDemuxer.py#L73-L77

Also please note that sometimes color space and color range information isn't present in video file, then you can only guess the actual values.

DanCorvesor commented 1 year ago

Hi @RomanArzumanyan

So in this case for this test video, I'm getting: ColorSpace.UNSPEC ColorRange.UDEF

But it says this is unsupported, what can I do in this case?

Also, related, in the case where you mention you need to guess the colour space/range info, is there any sensible algorithmic way to do that?

DanCorvesor commented 1 year ago

Hi @RomanArzumanyan, update I actually messed up I was comparing integer values to float values so the difference is very very close - which is great. Sorry for the messing around on this.

However, going back to your point that you made - given the colour spaces and ranges are unsupported, are the ones I specified originally good defaults (they are working in this case so seem to be). Is there a way I could check in code whether a colour space is supported when I initialise the colour converter class to check if the colour space/range inferred from the demuxer is supported and use these defaults if not?

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor

Is there a way I could check in code whether a colour space is supported when I initialise the colour converter class to check if the colour space/range inferred from the demuxer is supported and use these defaults if not?

You can get the values with nvDmx.ColorSpace() and nvDmx.ColorRange(). If they return ColorSpace.UNSPEC and / or ColorRange.UDEF then you can only guess or hard-code values. Choosing different color conversion options won't crush you program, it will only affect the colors. Basically that's what happening in your PyAV vs. VPF comparison test. PyAV just chooses different default options. Since the are just 4 possible combinations (2 color space and 2 color range options) you can play around and see how simialr PyAV and VPF results are.

DanCorvesor commented 1 year ago

Thanks again @RomanArzumanyan last question (I hope) - you mentioned there are two default options for both. What are these, how can I see them?

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor

Honestly I don't know what the default values for PyAV are. I assume the decision is made somewhere deep within FFMpeg guts.

For VPF there are no default values. That's done on purpose. Inaccurate color space conversion can impose penalty on inference accuracy. There were couple issues of this nature, you can find them in the list of closed issues if you like.

you mentioned there are two default options for both. What are these, how can I see them?

If you want to see possible values for color space and color range, here they are: https://github.com/NVIDIA/VideoProcessingFramework/blob/d8d5d1874c65ecfe6a82db2c282182e1b865452e/src/PyNvCodec/src/PyNvCodec.cpp#L240-L250

2 most common SDR color spaces are supported: BT.601 and BT.709. Those define the coefficients of YUV > RGB color space conversion.

Also, 2 most common SDR color ranges are supported: narrow (MPEG) where pixel range is within [16;235] and wide (JPEG) which means [0;255] pixel range.

RomanArzumanyan commented 1 year ago

Hi @DanCorvesor Please LMK if your issue is resolved

DanCorvesor commented 1 year ago

Hi @RomanArzumanyan . Yes thanks for explaining, appreciate your help and support.

NVIDIA / VideoProcessingFramework

Frames and Timestamps(PTS) matching #492