Optimize GPU memory - Nvidia A100

kevinzezel commented 1 year ago

Hi,

How can i change these parameters to optimize memory?

CUVIDDECODECREATEINFO structure:

ulNumDecodeSurfaces ulNumOutputSurfaces DeinterlaceMode ulIntraDecodeOnly

https://developer.nvidia.com/blog/optimizing-video-memory-usage-with-the-nvdecode-api-and-nvidia-video-codec-sdk/

My GPU Nvidia A100 is consuming 500mb of RAM in a 1080p video stream over RTSP to decode. I have a project to process 3.600 cameras. I'm trying to reduce as much memory as possible.

Regards, Kevin

RomanArzumanyan commented 1 year ago

@kevinzezel

Nvdec memory consumption heavily depends on the input. H.264 on higher levels is known for having way too big DPB (16 frames). H.265 should have smaller footprint. Please provide more details WRT your input streams.

kevinzezel commented 1 year ago

Thanks for the answer @RomanArzumanyan

Follow my RTSP data:

{'index': 0, 'codec_name': 'hevc', 'codec_long_name': 'H.265 / HEVC (High Efficiency Video Coding)', 'profile': 'Main', 'codec_type': 'video', 'codec_tag_string': '[0][0][0][0]', 'codec_tag': '0x0000', 'width': 1280, 'height': 720, 'coded_width': 1280, 'coded_height': 720, 'closed_captions': 0, 'film_grain': 0, 'has_b_frames': 0, 'pix_fmt': 'yuvj420p', 'level': 93, 'color_range': 'pc', 'color_space': 'bt709', 'color_transfer': 'bt709', 'color_primaries': 'bt709', 'chroma_location': 'left', 'refs': 1, 'r_frame_rate': '90000/1', 'avg_frame_rate': '0/0', 'time_base': '1/90000', 'start_pts': 36000, 'start_time': '0.400000', 'extradata_size': 76, 'disposition': {'default': 0, 'dub': 0, 'original': 0, 'comment': 0, 'lyrics': 0, 'karaoke': 0, 'forced': 0, 'hearing_impaired': 0, 'visual_impaired': 0, 'clean_effects': 0, 'attached_pic': 0, 'timed_thumbnails': 0, 'captions': 0, 'descriptions': 0, 'metadata': 0, 'dependent': 0, 'still_image': 0}}

Is it possible to share the same video decode instance with more than one camera by creating some logic with threading queue?

Regards, Kevin

RomanArzumanyan commented 1 year ago

Hi @kevinzezel To my best knowldge, each video is decoded in it's own session and it's impossible to share resources between sessions. If it's possible I recommend you to do 2 things:

Check memory consumption with SampleDecode from Video Codec SDK Samples across multiple GPUs to see the possible difference. If you see the mem consumption difference across GPUs with SampleDecode it means the different amount of memory is allocated within driver itself and not by user application.
Re-build VPF with TRACK_TOKEN_ALLOCATIONS CMake option to inspect mem allocaitons within VPF.

Also I've created a PR which reduces memory usage in VPF according to this article: https://developer.nvidia.com/blog/optimizing-video-memory-usage-with-the-nvdecode-api-and-nvidia-video-codec-sdk/

Though HEVC already has moderate DPB size so you probably won't see much of a difference.

kevinzezel commented 1 year ago

Thanks @RomanArzumanyan,

Could you tell me if 250mb of VRAM for each stream is usual for this type of application?

Each thread I put to decode an RTSP stream takes up that much in VRAM.

Regards, Kevin

RomanArzumanyan commented 1 year ago

Hi @kevinzezel

It's hard to tell because when decoding with Video Codec SDK, one don't allocate memory in vRAM explicitly. Instead of that, application relies on built-in video parser which tells how much memory is needed. I recommend you to check out latest commit from master branch, it now has PR merged which main aim is memory footprint reduction.

https://github.com/NVIDIA/VideoProcessingFramework/blob/d8d5d1874c65ecfe6a82db2c282182e1b865452e/src/TC/src/NvDecoder.cpp#L166-L168

So VPF will basically allocate as small amount of memory as possible + some spare space for better pipelining. If your problem persists I assume the next place to go would be NVIDIA developers forum.

kevinzezel commented 1 year ago

I will download the new version and test. Thank you very much.

kevinzezel commented 1 year ago

@RomanArzumanyan

I'm trying to write a code where I can share the same decoder with more than one stream, since each stream only has 5FPS. Thus further reducing the use of VRAM memory.

I created a single queue for the decoder, where 2 RTSP streams are filling this queue.

It works only while data is coming from only one stream, when the other stream comes in, after a few frames, it gives an error.

Obs: Both streams are: {'index': 0, 'codec_name': 'hevc', 'codec_long_name': 'H.265 / HEVC (High Efficiency Video Coding)', 'profile': 'Main', 'codec_type': 'video', 'codec_tag_string': '[0][0][0][0]', 'codec_tag': '0x0000', 'width': 1280, 'height': 720, 'coded_width': 1280, 'coded_height': 720, 'closed_captions': 0, 'film_grain': 0, 'has_b_frames': 0, 'pix_fmt': 'yuvj420p', 'level': 93, 'color_range': 'pc', 'color_space': 'bt709', 'color_transfer': 'bt709', 'color_primaries': 'bt709', 'chroma_location': 'left', 'refs': 1, 'r_frame_rate': '90000/1', 'avg_frame_rate': '0/0', 'time_base': '1/90000', 'start_pts': 36000, 'start_time': '0.400000', 'extradata_size': 76, 'disposition': {'default': 0, 'dub': 0, 'original': 0, 'comment': 0, 'lyrics': 0, 'karaoke': 0, 'forced': 0, 'hearing_impaired': 0, 'visual_impaired': 0, 'clean_effects': 0, 'attached_pic': 0, 'timed_thumbnails': 0, 'captions': 0, 'descriptions': 0, 'metadata': 0, 'dependent': 0, 'still_image': 0}}

Error:

stream_1
stream_1
stream_1
stream_1
stream_1
stream_1
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
stream_2
Decode Error occurred for picture 44
HW decoder faced error. Re-create instance.

Follow the code to reproduce:

import subprocess
import numpy as np
import pycuda.driver as cuda
import PyNvCodec as nvc
from functions_gpu import *
import multiprocessing
from typing import Dict
from io import BytesIO
import json

def get_stream_params(url: str) -> Dict:
    cmd = [
        "ffprobe",
        "-rtsp_transport",
        "tcp",
        "-v",
        "quiet",
        "-print_format",
        "json",
        "-show_format",
        "-show_streams",
        url,
    ]

    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    stdout = proc.communicate()[0]

    bio = BytesIO(stdout)
    json_out = json.load(bio)

    params = {}
    if not "streams" in json_out:
        return {}

    for stream in json_out["streams"]:
        print(f'stream: {stream}')
        if stream["codec_type"] == "video":

            params["width"] = stream["width"]
            params["height"] = stream["height"]
            # params["framerate"] = float(eval(stream["avg_frame_rate"]))

            codec_name = stream["codec_name"]
            is_h264 = True if codec_name == "h264" else False
            is_hevc = True if codec_name == "hevc" else False
            if not is_h264 and not is_hevc:
                raise ValueError(
                    "Unsupported codec: "
                    + codec_name
                    + ". Only H.264 and HEVC are supported in this sample."
                )
            else:
                params["codec"] = (
                    nvc.CudaVideoCodec.H264 if is_h264 else nvc.CudaVideoCodec.HEVC
                )

                pix_fmt = stream["pix_fmt"]
                is_yuv420 = pix_fmt == "yuv420p"
                is_yuv444 = pix_fmt == "yuv444p"

                # YUVJ420P and YUVJ444P are deprecated but still wide spread, so handle
                # them as well. They also indicate JPEG color range.
                is_yuvj420 = pix_fmt == "yuvj420p"
                is_yuvj444 = pix_fmt == "yuvj444p"

                if is_yuvj420:
                    is_yuv420 = True
                    params["color_range"] = nvc.ColorRange.JPEG
                if is_yuvj444:
                    is_yuv444 = True
                    params["color_range"] = nvc.ColorRange.JPEG

                if not is_yuv420 and not is_yuv444:
                    raise ValueError(
                        "Unsupported pixel format: "
                        + pix_fmt
                        + ". Only YUV420 and YUV444 are supported in this sample."
                    )
                else:
                    params["format"] = (
                        nvc.PixelFormat.NV12 if is_yuv420 else nvc.PixelFormat.YUV444
                    )

                # Color range default option. We may have set when parsing
                # pixel format, so check first.
                if "color_range" not in params:
                    params["color_range"] = nvc.ColorRange.MPEG
                # Check actual value.
                if "color_range" in stream:
                    color_range = stream["color_range"]
                    if color_range == "pc" or color_range == "jpeg":
                        params["color_range"] = nvc.ColorRange.JPEG

                # Color space default option:
                params["color_space"] = nvc.ColorSpace.BT_601
                # Check actual value.
                if "color_space" in stream:
                    color_space = stream["color_space"]
                    if color_space == "bt709":
                        params["color_space"] = nvc.ColorSpace.BT_709
                return params
    return {}

def rtsp(queue_rtsp,url,stream_id):
    cmd = ["ffmpeg","-y","-rtsp_transport","tcp","-vsync","0", "-hide_banner", "-i",url,"-c:v","copy","-bsf:v",
        "hevc_mp4toannexb,dump_extra=all","-f","hevc","-"
    ]
    p1 = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    while True:

        bits = p1.stdout.read(4096)
        if len(bits) > 0:
            queue_rtsp.put([bits,stream_id])

urls = [["rtsp://test:test@192.168.0.12:8021","stream_1"],
        ["rtsp://test:test@192.168.0.12:8022","stream_2"]]

params = get_stream_params(urls[0][0])
if not len(params):
    raise ValueError("Can not get streams params")

GPU_ID = 0

w, h, f, c, g, cspace, crange  = params["width"],params["height"],params["format"],params["codec"],GPU_ID,params["color_space"],params["color_range"]

# Create HW decoder class
nvdec = nvc.PyNvDecoder(w, h, f, c, g)

cuda.init()
cuda_ctx = cuda.Device(0).retain_primary_context()
cuda_ctx.push()
cuda_str = cuda.Stream()
cuda_ctx.pop()

nvYuv = nvc.PySurfaceConverter(w, h, f, nvc.PixelFormat.YUV420, cuda_ctx.handle, cuda_str.handle) if cspace != nvc.ColorSpace.BT_709 else None
nvCvt = nvc.PySurfaceConverter(w, h, nvYuv.Format(), nvc.PixelFormat.RGB, cuda_ctx.handle, cuda_str.handle) if nvYuv else nvc.PySurfaceConverter(w, h, f, nvc.PixelFormat.RGB, cuda_ctx.handle, cuda_str.handle)
nvDwn = nvc.PySurfaceDownloader(w, h, nvCvt.Format(), cuda_ctx.handle, cuda_str.handle)
cc_ctx = nvc.ColorspaceConversionContext(cspace, crange)

queue_rtsp = multiprocessing.Queue()
for url, stream_id in urls:
    multiprocessing.Process(target=rtsp,args=(queue_rtsp,url,stream_id,),daemon=True).start()

while True:

    bits, stream_id = queue_rtsp.get()
    enc_packet = np.frombuffer(buffer=bits, dtype=np.uint8)
    pkt_data = nvc.PacketData()

    surf = nvdec.DecodeSurfaceFromPacket(enc_packet, pkt_data)

    print(stream_id)

RomanArzumanyan commented 1 year ago

Hi @kevinzezel

To my best knowledge, this approach isn't supported by Video Codec SDK on the API level. One decoding session corresponds to one bitstream. It means that single PyNvDecoder instance can only decode one input.

since each stream only has 5FPS

That doesn't matter on decoder level simply because there's no such thing as FPS in video codec standards. You may think of video stream as list of frames and decoder as iterator over the list. FPS, PTS, DTS and other entities exist on video container level.

5 FPS RTSP decoding session will consume 6 times less Nvdec resources then 30 FPS RTSP stream, although it will occupy same amount of vRAM as 30 FPS RTSP stream (if encoded with same settings).

kevinzezel commented 1 year ago

Thank you very much!

Regards, Kevin

NVIDIA / VideoProcessingFramework

Optimize GPU memory - Nvidia A100 #482