Optimize for multiple streams (drop frames, reduce delays, reduce memory usage)

mfoglio commented 2 years ago

I want to decode as many RTSP streams as possible on a single GPU. Since my application is incapable of processing 30 FPS per streams, it wouldn't be an issue if some of the frames would be dropped. I probably won't need more than 5 FPS per streams. I am assuming there could be way to reduce the workload by dropping data at some unknown-to-me step during the pipeline. I would also need to process the streams in real time. When following the PyTorch tutorial from the wiki I found some kind of delay: if I stopped my application for a while (e.g. time.sleep(30)) and then I resumed it, the pipeline was returning me frames from 30 seconds ago. I would like the pipeline to always return real-time frames. I believe this would also imply using less memory since older data could be dropped. Memory is particularly important for me since I want to decode many streams. I just know the high details of h264 video decoding. I know that P, B, and I frames mean that you cannot simply drop some data and then start decoding without possibly encountering corrupting frames. However, I have encountered before similar issues with gstreamer on CPU (high CPU usage, more frames decoded then needed, delays and high memory usage) and I came up with a pipeline that was able to reduce delays (therefore also saving memory) while always returning me real-time (present) frames. How can I achieve my goal? Is there any argument I could pass to the PyNvDecoder? I see it can receivedict as argument but I couldn't find more details. Here's the code that I am using so far. It is basically the PyTorch wiki tutorial:

import torch
import PyNvCodec as nvc
import PytorchNvCodec as pnvc

gpu_id = 0
input_file = "rtsp_stream_url"

nvDec = nvc.PyNvDecoder(input_file, gpu_id)
target_h, target_w = nvDec.Height(), nvDec.Width()

cspace, crange = nvDec.ColorSpace(), nvDec.ColorRange()
if nvc.ColorSpace.UNSPEC == cspace:
    cspace = nvc.ColorSpace.BT_601
if nvc.ColorRange.UDEF == crange:
    crange = nvc.ColorRange.MPEG
cc_ctx = nvc.ColorspaceConversionContext(cspace, crange)

to_rgb = nvc.PySurfaceConverter(nvDec.Width(), nvDec.Height(), nvc.PixelFormat.NV12, nvc.PixelFormat.RGB, gpu_id)
to_planar = nvc.PySurfaceConverter(nvDec.Width(), nvDec.Height(), nvc.PixelFormat.RGB, nvc.PixelFormat.RGB_PLANAR, gpu_id)

while True:
    # Obtain NV12 decoded surface from decoder;
    rawSurface = nvDec.DecodeSingleSurface()
    if (rawSurface.Empty()):
        break

    rgb_byte = to_rgb.Execute(rawSurface, cc_ctx)
    rgb_planar = to_planar.Execute(rgb_byte, cc_ctx)

    surfPlane = rgb_planar.PlanePtr()
    surface_tensor = pnvc.makefromDevicePtrUint8(
        surfPlane.GpuMem(), surfPlane.Width(), surfPlane.Height(), surfPlane.Pitch(), surfPlane.ElemSize()
    )
    surface_tensor = surface_tensor.reshape(3, target_h, target_w)  # TODO: check that we are not copying data
    surface_tensor = surface_tensor.permute((1, 2, 0))  # TODO: check that we are not copying data
    # DO SLOW STUFF

Any hint on where to start would be really appreciated. This project is fantastic!

rarzumanyan commented 2 years ago

@mfoglio

YUVJ420P basically means YUV420P with JPEG colour range (0;255), it's same to YUV420P in terms of Nvdec settings, shall also correspond to nvc.PixelFormat.NV12

I shall add this to sample.

mfoglio commented 2 years ago

Thanks. Last question about video parameters: from this example https://github.com/NVIDIA/VideoProcessingFramework/blob/master/SampleDecodeMultiThread.py it looked like color space and range are needed to determine the correct pipeline to obtain RGB frames. Is this true or it is sufficient to know the format (YUVJ420P, YUV420 YUV444, etc)?

mfoglio commented 2 years ago

I am not sure color range and color space can be accessed from pyav: https://github.com/PyAV-Org/PyAV/pull/686

rarzumanyan commented 2 years ago

Hi @mfoglio

color space and range are needed to determine the correct pipeline to obtain RGB frames

In VPF, there are 2 different color spaces supported (bt601, bt709) and 2 color ranges (mpeg, jpeg) which gives 4 possible ways of nv12 > rgb color conversions.

If you provide converter with wrong parameters, it will do the conversion anyway, but the colors will be slightly off. Pictures below illustrate this case (taken from #226). They were converted with different color spaces:

I don't know from personal experience, but those VPF users who use NN for live inference often say that color rendition accuracy is important aspect of inference prediction accuracy. This is the sole reason behind over complicated color conversion API (you can't "just" convert nv12 or yuv420 to rgb).

I'll investigate on color space and color range extraction with PyAV. As a plan B we always have a ffprobe. Regarding the color yuvj420p pixel format, it's clear indicator of yuv420p + jpeg color range, this is what's said in FFMpeg pixel format description.

rarzumanyan commented 2 years ago

Hi @mfoglio

ffprobe can actually produce JSON output so parsing is easy and clean. I've replaced PyAV with ffprobe, all necessary stream parameters are now extracted including color space and color range. These are optional, not all streams have them. I've used BT601 and MPEG as default values.

Now there are less dependencies and more useful information )

Please check out SampleDecodeRTSP.py in master ToT.

stepstep123 commented 2 years ago

yuvj420p @mfoglio @rarzumanyan we got the same problem.

vpf to decode rtmp url and Spawn FFMpeg sub-process to decode, but vpf got wrong result and FFMpeg got None reuslt.

rarzumanyan commented 2 years ago

@stepstep123

Please elaborate on that, what is "wrong result"?

mfoglio commented 2 years ago

Hello @rarzumanyan , after several attempts, I think I managed to fix my code and VPF seems to be stable right now. I will try to optimize my code to reduce VRAM usage and I will let you know if I find any issue. Thanks!

rarzumanyan commented 2 years ago

Thanks for the update, @mfoglio

Glad to hear you you was able to fix it, please LMK if you need further assistance. After we resolve this issue, please consider sharing your findings regarding massive RTSP processing, I'm sure it will be extremely helpful to other VPF users.

mfoglio commented 2 years ago

Thanks @rarzumanyan . I have a question regarding optimization. I have a process running my main application, and a few other processes running VPF to decode one video stream each. Now that with VPF we uses processes instead of threads, is there any benefit in using cuda streams with VPF? In other words, when we put VPF decoding of a video stream into a Process instead of a Thread, do we still need to use streams to avoid conflicts across different processes?

rarzumanyan commented 2 years ago

Hi @mfoglio

Honestly I'm running out of depth here.

To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.

I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.

stu-github commented 2 years ago

Hi @mfoglio

Honestly I'm running out of depth here.

To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.

I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.

Thanks @rarzumanyan .

After decoding the packet,

in SampleDecodeRTSP.py,

`

    # Decode
    enc_packet = np.frombuffer(buffer=bits, dtype=np.uint8)
    pkt_data = nvc.PacketData()
    try:
        surf = nvdec.DecodeSurfaceFromPacket(enc_packet, pkt_data)
        if not surf.Empty():
            fd += 1
            # Shifts towards underflow to avoid increasing vRAM consumption.
            if pkt_data.bsl < read_size:
                read_size = pkt_data.bsl
            # Print process ID every second or so.
            fps = int(params['framerate'])
            if not fd % fps:
                print(name)`

I save to jpeg like this

`

            yuv = to_yuv.Execute(surf, cc2)
            rgb24 = to_rgb.Execute(yuv, cc2)
            rgb24.PlanePtr().Export(surface_tensor.data_ptr(), w * 3, gpu_id)

            # PROCESS YOUR TENSOR HERE.
            # THIS DUMMY PROCESSING WILL JUST MAKE VIDEO FRAMES DARKER.
            dark_frame = torch.floor_divide(surface_tensor, 2)

            pil = Image.fromarray(surface_tensor.cpu().numpy())
            pil.save('output/%d.jpg'%index)`

Is it correct?

rarzumanyan commented 2 years ago

I save to jpeg like this Is it correct?

If it works, it's correct ;) If not - inspect raw RGB frame with OpenCV to see what's happening.

stu-github commented 2 years ago

I save to jpeg like this Is it correct?

If it works, it's correct ;) If not - inspect raw RGB frame with OpenCV to see what's happening.

It works.

I want to find(or write) the more efficient code, but I can't achieve now :(

Thank you!

rarzumanyan commented 2 years ago

Hi @stu-github

I want to find(or write) the more efficient code, but I can't achieve now :(

Could you start a new issue on that topic? This one is getting chunky.

rarzumanyan commented 2 years ago

Hi @mfoglio

What's the current status of this issue? Do you see improvements / can we close it?

jeshels commented 2 years ago

Hi @stu-github , I am waiting for @rarzumanyan to finish up some fixes. I will share the code as soon as we have something more stable working

Hi @mfoglio, it would be very helpful if you can share your code, or some tips you've learned along the way.

Also, I see that PytorchNvCodec.cpp was recently updated to support specifying CUDA stream* and asynchronous copying when creating a PyTorch tensor. This is useful.

* In current implementation, the torch::full ignores the user provided CUDA stream and operates on the globally set CUDA stream of PyTorch, but it can be fixed.

rarzumanyan commented 2 years ago

Hi @jeshels

In current implementation, the torch::full ignores the user provided CUDA stream

Thanks for bringing this up. I'm not an expert in torch so if you find thing like that please feel free to submit an issue.

P. S. As far as I understood the torch c++ tensor creation API, torch::full doesn't accept CUDA stream as argument. Am I missing something?

jeshels commented 2 years ago

@rarzumanyan, sure thing 👍

PyTorch API is different. Instead of accepting stream as a parameter for every function, one needs to set the CUDA stream separately. Then, everything which executes afterwards is run in the context of that CUDA stream until a new CUDA stream is set. This can be controlled from both Python and C++. As far as I understand, setting a CUDA stream in one thread doesn't affect other threads.

Since I'm a novice in this subject myself, I'm not sure which way is more appropriate for this case.

Note that if you'd like to go with a C++ based solution, the function getStreamFromExternal() may be useful if you're having trouble passing a Pytorch CUDA stream object from Python to C++. However, note that due to this issue, this solution would require PyTorch version >= 1.11.0.

pzyang613 commented 2 years ago

@mfoglio , hello, i have the same question with you.I want to drop some frames, not every frame is decoded. How did you solve this problem? What should I do based on VPF? Tansks a lot.

timsainb commented 12 months ago

Build VPF with USE_NVTX option and launch it under Nsight Systems to collect application timeline.

@rarzumanyan can you explain how to do this?

NVIDIA / VideoProcessingFramework

Optimize for multiple streams (drop frames, reduce delays, reduce memory usage) #257