How to properly decode multiple frames and convert them to PyTorch tensor

My config:

Video Processing Framework built from sources according to instructions
Ubuntu 22.04
Python 3.10.12
PyTorch 2.1.0.dev20230706+cu121
CUDA 12.3
NVIDIA Driver 545.23.08
FFMPEG version N-112697-gf084e9b0be (built from sources)

I'm using the following class to decode video frames and convert them to PyTorch tensors:

class VideoDataset(Dataset):
    def __init__(self, video_path, gpu_id=0):
        self.video_path = video_path
        self.gpu_id = gpu_id
        self.decoder = None
        self.nv12_to_yuv = None
        self.yuv420_to_rgb = None
        self.rgb_to_pln = None
        self.cc_ctx = None
        self.width = None
        self.height = None

    def open(self):
        # Init HW decoder, convertor, resizer and color space context:
        self.decoder = nvc.PyNvDecoder(self.video_path, self.gpu_id)
        self.width, self.height = self.decoder.Width(), self.decoder.Height()
        self.nv12_to_yuv = nvc.PySurfaceConverter(
            self.width, self.height, nvc.PixelFormat.NV12, nvc.PixelFormat.YUV420, self.gpu_id
        )
        self.yuv420_to_rgb = nvc.PySurfaceConverter(
            self.width, self.height, nvc.PixelFormat.YUV420, nvc.PixelFormat.RGB, self.gpu_id
        )
        self.rgb_to_pln = nvc.PySurfaceConverter(
            self.width, self.height, nvc.PixelFormat.RGB, nvc.PixelFormat.RGB_PLANAR, self.gpu_id
        )
        self.cc_ctx = nvc.ColorspaceConversionContext(
            nvc.ColorSpace.BT_601, nvc.ColorRange.MPEG
        )

        return self

    def __len__(self):
        return self.decoder.Numframes()

    def __getitem__(self, idx):
        if idx >= len(self):
            raise StopIteration

        # Open video if it is not opened yet:
        if self.decoder is None:
            self.open()

        # Decode 1 compressed video frame to CUDA memory:
        nv12_surface = self.decoder.DecodeSingleSurface()
        if nv12_surface.Empty():
            print("Can not decode frame")
            return None

        # Convert from NV12 to YUV420
        # This extra step is required because not all NV12 -> RGB conversions
        # implemented in NPP support all color spaces and ranges:
        yuv420 = self.nv12_to_yuv.Execute(nv12_surface, self.cc_ctx)
        if yuv420.Empty():
            print("Can not convert nv12 -> yuv420")
            return None

        # Convert from YUV420 to interleaved RGB:
        rgb24 = self.yuv420_to_rgb.Execute(yuv420, self.cc_ctx)
        if rgb24.Empty():
            print("Can not convert yuv420 -> rgb")
            return None

        # Convert from RGB to planar RGB:
        rgb24_planar = self.rgb_to_pln.Execute(rgb24, self.cc_ctx)
        if rgb24_planar.Empty():
            print("Can not convert rgb -> rgb planar")
            return None

        if rgb24_planar.Format() != nvc.PixelFormat.RGB_PLANAR:
            raise RuntimeError("Surface shall be of RGB_PLANAR pixel format")

        surf_plane = rgb24_planar.PlanePtr()
        img_tensor = pnvc.DptrToTensor(
            surf_plane.GpuMem(),
            surf_plane.Width(),
            surf_plane.Height(),
            surf_plane.Pitch(),
            surf_plane.ElemSize(),
        )
        if img_tensor is None:
            raise RuntimeError("Can not export to tensor.")

        return img_tensor

Then I run a loop, collect two adjacent frames and merge them into a batch, that is passed to the model:

for idx, img_tensor in enumerate(dataset):
    batch.append(img_tensor)
    if len(batch) < 2:
        continue

    batch = torch.cat(batch, 0)
    batch = batch.float() / 255.0

    if torch.equal(batch[0], batch[1]):
        print('Equal', idx)

   preds = model(batch)

To check for correctness I use the torch.equal function. If I'm using the GPU, then torch.equal finds many identical frames (each run of the loop it happens on random frames). When I save such images, they are indeed the same and there are also various artifacts on them (it seems that the GPU memory is being overwritten by something).

I tried using torch.clone, but it didn't help.

But if I run it on the CPU it works fine. If I move the tensor from the GPU to the CPU before calling torch.cat, and then return it to the GPU before calling the model, then it also works correctly.

If I don't use batch mode (only 1 frame per iteration) it works fine again.

I don't use multithreading.

NVIDIA / VideoProcessingFramework

How to properly decode multiple frames and convert them to PyTorch tensor #551