any sample for webcam,thank you very much

Thanks for the great jobs. In my case, i get video/frame from webcam using opencv, so i need copy it to gpu which costs a lot; after processing by model(in fact it costs little), i want the result to be numpy.narray dtype, so i'll copy it from gpu to cpu; so fisrt, could i use vpf to get video/frame from webcam without Host to Device copies; second, is there a method copy data from gpu to cpu with high efficiency；

And i didn't find the describe details about how to use vpf, only some samples in github and brief introduction in https://developer.nvidia.com/blog/vpf-hardware-accelerated-video-processing-framework-in-python/ Any advice will be appreciated

Hi @xiang-zhe

Most web cameras produce video in either mjpeg compressed format of yuv422 raw format. E. g. this is how to check your webcam in Linux: (more information available at this ffmpeg doc page)

v4l2-ctl --list-devices

On my machine with Logitech web camera it outputs:

UVC Camera (usb-0000:00:14.0-13.3):
        /dev/video4
        /dev/video5

Then you check what video formats does it support:

ffmpeg -f v4l2 -list_formats all -i /dev/video4

Produces output:

[video4linux2,v4l2 @ 0x55dd5b6fc6c0] Raw       :     yuyv422 :           YUYV 4:2:2 : 640x480 160x120 176x144 320x176 320x240 352x288 432x240 544x288 640x360 752x416 800x448 800x600 864x480 960x544 960x720 1024x576 1184x656 1280x720 1280x960
[video4linux2,v4l2 @ 0x55dd5b6fc6c0] Compressed:       mjpeg :          Motion-JPEG : 640x480 160x120 176x144 320x176 320x240 352x288 432x240 544x288 640x360 752x416 800x448 800x600 864x480 960x544 960x720 1024x576 1184x656 1280x720 1280x960

The problem is mjpeg isn't supported by Nvdec: Nvdec matrix support and yuv422 files are too heavy to send over the USB in real time. So you either have to have a usb camera which outputs video in one of formats supported by Nvdec or use OpenCV as you do now.

And i didn't find the describe details about how to use vpf, only some samples in github

Sample scripts illustrate most typical use cases and show how to use VPF. Unfortunately I don't have enough bandwidth to support the documentation especially when new features are added or some bugs are fixed.

is there a method copy data from gpu to cpu with high efficiency

Yes, there's a PySurfaceDownloader class for that, it's usage is shown in SampleDemuxDecode.py.

thanks again，but i got some other questions，and i work on win10， 1，A Camera capture card was connected between my camera and my PC（by PCI not usb） and cmd ffmpeg -list_devices true -f dshow -i dummy, it show

[dshow @ 0000014e3966d540] DirectShow video devices (some may be both video and audio devices)
[dshow @ 0000014e3966d540]  "Game Capture 4K60 Pro MK.2 Video"
[dshow @ 0000014e3966d540]     Alternative name "@device_pnp_\\?\pci#ven_12ab&dev_0710&subsys_000e1cfa&rev_00#4&38ab2860&0&0008#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\{6f814be9-9af6-43cf-9249-c03401000226}"
[dshow @ 0000014e3966d540]  "CYP USB Video Device"
[dshow @ 0000014e3966d540]     Alternative name "@device_pnp_\\?\usb#vid_5000&pid_3104&mi_00#6&172a5fb1&0&0000#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\global"
[dshow @ 0000014e3966d540]  "OBS-Camera"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{27B05C2D-93DC-474A-A5DA-9BBA34CB2A9C}"
[dshow @ 0000014e3966d540]  "OBS-Camera2"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{27B05C2D-93DC-474A-A5DA-9BBA34CB2A9D}"
[dshow @ 0000014e3966d540]  "screen-capture-recorder"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{4EA69364-2C8A-4AE6-A561-56E4B5044439}"
[dshow @ 0000014e3966d540]  "Camera (NVIDIA Broadcast)"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{7BBFF097-B3FB-4B26-B685-7A998DE7CEAC}"
[dshow @ 0000014e3966d540]  "OBS Virtual Camera"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{A3FCE0F5-3493-419F-958A-ABA1250EC20B}"
[dshow @ 0000014e3966d540]  "Elgato Screen Link"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{D2F41684-D46F-440B-8096-4FCD528ED5A3}"
[dshow @ 0000014e3966d540] DirectShow audio devices
[dshow @ 0000014e3966d540]  "立体声混音 (Realtek(R) Audio)"
[dshow @ 0000014e3966d540]     Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{30E55EA4-BA1D-465B-B217-ACF05E273FAB}"
[dshow @ 0000014e3966d540]  "Game Capture 4K60 Pro MK.2 Audio"
[dshow @ 0000014e3966d540]     Alternative name "@device_pnp_\\?\pci#ven_12ab&dev_0710&subsys_000e1cfa&rev_00#4&38ab2860&0&0008#{33d9a762-90c8-11d0-bd43-00a0c911ce86}\{6f814be9-9af6-43cf-9249-c03401000326}"
[dshow @ 0000014e3966d540]  "virtual-audio-capturer"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\{8E146464-DB61-4309-AFA1-3578E927E935}"
[dshow @ 0000014e3966d540]  "OBS-Audio"
[dshow @ 0000014e3966d540]     Alternative name "@device_sw_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\{B750E5CD-5E7E-4ED3-B675-A5003C439997}"
[dshow @ 0000014e3966d540]  "麦克风 (NVIDIA Broadcast)"
[dshow @ 0000014e3966d540]     Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{169986C3-9209-41F3-8EDC-BBFA74D73DB8}"
[dshow @ 0000014e3966d540]  "麦克风 (CYP USB Audio Device)"
[dshow @ 0000014e3966d540]     Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{A6529422-92D8-458B-ACE7-6261BAE98487}"
[dshow @ 0000014e3966d540]  "麦克风 (2- USB Audio Device)"
[dshow @ 0000014e3966d540]     Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{E760F9FA-F5C5-4A79-8E0A-8CB59546C79C}"
dummy: Immediate exit requested

cmd ffmpeg -list_options true -f dshow -i video="Game Capture 4K60 Pro MK.2 Video", it shows

[dshow @ 000001b5c73fd5c0] DirectShow video device options (from video devices)
[dshow @ 000001b5c73fd5c0]  Pin "Video Capture" (alternative pin name "0")
[dshow @ 000001b5c73fd5c0]   pixel_format=yuyv422  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=yuyv422  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=yuv420p  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=yuv420p  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=nv12  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=nv12  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=bgr24  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=bgr24  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=bgr0  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=bgr0  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   unknown compression type 0x30313050  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   unknown compression type 0x30313050  min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=yuyv422  min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0]   pixel_format=yuyv422  min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=yuv420p  min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0]   pixel_format=yuv420p  min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=nv12  min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0]   pixel_format=nv12  min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0]   pixel_format=bgr24  min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002

if cmd ffprobe -show_format -f dshow -i video="Game Capture 4K60 Pro MK.2 Video", it shows

Input #0, dshow, from 'video=Game Capture 4K60 Pro MK.2 Video':
  Duration: N/A, start: 274327.476000, bitrate: N/A
  Stream #0:0: Video: rawvideo (YUY2 / 0x32595559), yuyv422, 1920x1080, 50 fps, 50 tbr, 10000k tbn, 10000k tbc
[FORMAT]
filename=video=Game Capture 4K60 Pro MK.2 Video
nb_streams=1
nb_programs=0
format_name=dshow
format_long_name=DirectShow capture
start_time=274327.476000
duration=N/A
size=N/A
bit_rate=N/A
probe_score=25
[/FORMAT]

is that mean my cam supports those raw formats(yuyv422, yuv420p,nv12,bgr24,bgr0)? so i NOT need to decode and just receive and convert to RGB array. if that, how can i receive the raw video by vpf（now i use opencv）； or，i didn't get the right video format, or i misunderstood the video transmission method，

2，now i get a array by opencv，i‘ll use nvc.PyFrameUploader to load it to GPU surface,

nvUpl = nvc.PyFrameUploader(int(w), int(h), nvc.PixelFormat.RGB, gpuID)
surface_tensor = torch.zeros(h, w, 3, dtype=torch.uint8, device=torch.device(f'cuda:{gpuID}'))
rawSurface = nvUpl.UploadSingleFrame(rawFrame)  #rawSurface.Format() == nvc.PixelFormat.RGB

then convert it to torch.tensor like that，

rawSurface.PlanePtr().Export(surface_tensor.data_ptr(), w * 3, gpuID)

but i saw antoher way by pnvc, like that,

# Export to PyTorch tensor
surf_plane = rgb24_planar.PlanePtr()
img_tensor = pnvc.makefromDevicePtrUint8(surf_plane.GpuMem(),
                                         surf_plane.Width(),
                                         surf_plane.Height(),
                                         surf_plane.Pitch(),
                                         surf_plane.ElemSize())
img_tensor.resize_(3, target_h, target_w)
img_tensor = img_tensor.type(dtype=torch.cuda.FloatTensor)
img_tensor = torch.divide(img_tensor, 255.0)

any difference between those two ways? or i just use one of them both OK.

3, after processing by torch, i get a tensor in GPU, i know there is a nvDwn.DownloadSingleSurface class can convert surface to numpy,

is there a method copy data from gpu to cpu with high efficiency

Yes, there's a PySurfaceDownloader class for that, it's usage is shown in SampleDemuxDecode.py.

but i didn't know how to convert a torch.tensor to a surface, there're some issues like https://github.com/NVIDIA/VideoProcessingFramework/issues/109, https://github.com/NVIDIA/VideoProcessingFramework/issues/118, but i didn't get useful infos,

Unfortunately, the reverse procedure isn't implemented and there's no way to convert a Pytorch tensor to VPF surface (which is plain CUdeviceptr).

i didn't know is there any thing updated , or i can't convert a torch.tensor on GPU to a surface directly. thanks.

Hi @xiang-zhe

so i NOT need to decode and just receive and convert to RGB array

Those formats you've mentioned are raw image formats, they are not compressed. You don't need to decode them.

how can i receive the raw video by vpf

If your camera supports raw yuv420p / nv12 / rgb output, you don't need VPF to obtain raw video frames from device. You can do that just fine with OpenCV because there's nothing a GPU can accelerate here - you just receive array of pixels over USB.

As soon as you get your video frame as uint8 numpy array with the help of OpenCV, you may upload it to GPU using PyFrameUploader and then export to torch.tensor for NN processing.

There are 2 ways to export VPF Surface to Pytorch tensor:

Using PytorchNvCodec module which utilizes Pytorch C++ API for that. It introduces the dependency against torch module in your python code.
By using SurfacePlane.Export method which is a simple CUDA DtoD memcopy. It doesn't introduce dependency against torch in your python code.

Both options are perfectly fine, just use whatever works best for you. One thing to take care of is the pixel format - some NN expect your tensor to be planar float32 RGB, some prefer interleaved float32 RGB, normalization range may be different and such. You can find some examples of Surface pre-processing in SampleTorchResnet.py and SampleTensorRTResnet.py.

i didn't know how to convert a torch.tensor to a surface

Please take a look at SamplePyTorch.py: https://github.com/NVIDIA/VideoProcessingFramework/blob/b896bef16a58e1183bcaa4406bd6b5024e890e50/SamplePyTorch.py#L73-L98

If you just want to download the content of your torch.tensor to numpy array, you don't need VPF for that, use torch.tensor.cpu().numpy() instead.

Hi @rarzumanyan , I was so careless that I didn't see it，

i didn't know how to convert a torch.tensor to a surface

Please take a look at SamplePyTorch.py:

https://github.com/NVIDIA/VideoProcessingFramework/blob/b896bef16a58e1183bcaa4406bd6b5024e890e50/SamplePyTorch.py#L73-L98

in my case， ther’e‘re three time-consuming processes， 1，numpy to gpu， ~0.016s, if using numpy.to("cuda"), now using PyFrameUploader (<0.004s) 2, processing by NN, ~0.01s 3, tensor.cuda to numpy, ~ 0.02s, if using tensor.to("cpu"), so i hope to reduce it by VPF(tensor to surface, then to numpy by nvDwn.DownloadSingleSurface) instead of using

torch.tensor.cpu().numpy()

i'll try it, =================================update i tested it， code like

class VPF():
    def __init__(self, width, height, gpuID):
        #self.init_vpf()
        self.w = width
        self.h = height
        self.gpuID = gpuID 
        self.surface_tensor = torch_zeros(self.h, self.w, 3, dtype=torch_uint8, device=torch_device(f'cuda:{self.gpuID}'))
        self.surface_rgb = nvc.Surface.Make(nvc.PixelFormat.RGB, self.w, self.h, self.gpuID)
        self.frame = numpy_ndarray(shape=(self.surface_rgb.HostSize()), dtype=numpy_uint8)
        self.nvUpl = nvc.PyFrameUploader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)
        self.nvDwn = nvc.PySurfaceDownloader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)

    def numpy2tensor(self, rawFrame):
        rawSurface = self.nvUpl.UploadSingleFrame(rawFrame)  #rawSurface.Format() == nvc.PixelFormat.RGB
        rawSurface.PlanePtr().Export(self.surface_tensor.data_ptr(), self.w * 3, self.gpuID)  #surface_tensor.dtype == (torch.tensor, device = "cuda")
        return self.surface_tensor

    def tensor2numpy(self, rawTensor):
        self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w * 3, self.gpuID)
        success = self.nvDwn.DownloadSingleSurface(self.surface_rgb, self.frame)
        if success:
            return self.frame
        else:
            print('Failed to download surface')

it's weird that，numpy -> surface_rgb -> surface_tensor costs about 0.004s， but surface_tensor -> surface_rgb -> numpy costs about 0.016( faster than tensor.to("cpu") which cost more than 0.02s, but this advantage is limited comparing to nump2tensor ); I do not understand the underlying mechanism of C++ ，but i think those two processes（numpy2tensor and tensor2numpy) are the same, why their costing are so different?

thanks again!

Hi @xiang-zhe There's no need to guess, you can easily collect performance profile of your application.

When building VPF from source, opt-in USE_NVTX option to enable NVTX markers support. Then run your application under Nsight Systems profiler to see all the CUDA API calls and NVTX functions in application timelie.

Select Python interpreter, path to your script and it's arguments as target application. Opt-in "Collect CUDA trace" and "Collect NVTX trace" options.

You will see all VPF tasks in CUDA API calls in your app:

P. S. Please pull origin master first, I've added missing color conversion contexts in SamplePyTorch.py.

Hi @rarzumanyan , it seems that i didn't need color conversion because i upload numpy_rgb to surface_rgb directly; but i got another problem: it look like one broken frame (snowflakes picture) happened about every 10 frames ; and no memory leak for cpu and gpu；the first frame is always broken

rawFrame = vpf.tensor2numpy(rawTensor[0])
rawFrame =rawFrame.reshape((3,1080,1920)).transpose((1,2,0))

2021-10-12 13-13-48 的屏幕截图

i count it twice, the numbers of broken frame like,

one like:
0
17
28
38
86
97
127
another like:
0
10
20
50
71
91
102
112
132

but when using : rawFrame = rawTensor.to("cpu").numpy()[0] it looks normal.

thanks again

++++++++++++++++++++++++++++++++update it seems related to pytorch.jit module，when commenting pytorch.jit，most of frames are broken but one is fine（it looks like reverse comparing to the case above）.

        self.model.load_state_dict(torch.load(checkpoint, map_location=device))
        #self.model = torch.jit.script(self.model)
        #self.model = torch.jit.freeze(self.model)
        self.device = device

+++++++++++++++++++++++++++++++++update and it seems that tensor2numpy() has some problem. because the first frame is always broken, so i show it by two ways, code is like

cv2.show("im",vpf.tensor2numpy(raw_tensor[0][0]))
cv2.waitKey(0)
cv2.show("im", raw_tensor[0][0].to("cpu").numpy())
cv2.waitKey(0)

so the first picture is broken,and the second is fine.

when i change a usb logit webcam instead of the PCI Game Capture 4K60 Pro MK.2 Video， all frames are broken with vpf.tensor2numpy(raw_tensor[0][0]) but fine with raw_tensor[0][0].to("cpu").numpy()

and my vpf is like:

vpf = VPF(width,height,0)
class VPF():
    def __init__(self, width, height, gpuID):
        #self.init_vpf()
        self.w = width
        self.h = height
        self.gpuID = gpuID 
        #self.alpha = alpha
        self.surface_tensor = torch_zeros(self.h, self.w, 3, dtype=torch_uint8, device=torch_device(f'cuda:{self.gpuID}'))
        self.surface_rgb = nvc.Surface.Make(nvc.PixelFormat.RGB, self.w, self.h, self.gpuID)
        self.frame = numpy_ndarray(shape=(self.surface_rgb.HostSize()), dtype=numpy_uint8)
        self.nvUpl = nvc.PyFrameUploader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)
        self.nvDwn = nvc.PySurfaceDownloader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)  #self.surface_rgb.Format()

    def numpy2tensor(self, rawFrame):  ## input [H, W, C]  ## output [C, H, W]
        #print("inframe:", rawFrame.shape)
        rawSurface = self.nvUpl.UploadSingleFrame(rawFrame)  #rawSurface.Format() == nvc.PixelFormat.RGB  rawSurface.Width()==1920
        #print(dir(rawSurface))
        #print(rawSurface.Width())  
        rawSurface.PlanePtr().Export(self.surface_tensor.data_ptr(), self.w*3, self.gpuID)  ## self.surface_tensor.shape == torch.Size([1080, 1920, 3]) device == "cuda"
        #print("outtensor:", self.surface_tensor.shape)  
        return self.surface_tensor.permute(2,0,1)

    def tensor2numpy(self, rawTensor):  #### input [C, H, W]  ## output [H, W, C]
        #print("intensor:", rawTensor.shape)
        self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)
        success = self.nvDwn.DownloadSingleSurface(self.surface_rgb, self.frame)
        #print("outframe:", self.frame.reshape(self.w, self.h, 3).transpose((1,0,2)).shape)
        if success:
            ret = self.frame.reshape(3, self.h, self.w).transpose(1,2,0)
            return ret
        else:
            print('Failed to download surface')

any help will be greatly appreciated~！

i found a way to fix the broken frame, but it looks stupid． when calling tensor2numpy(), add a line to print(rawTensor),

    def tensor2numpy(self, rawTensor):  #### input [C, H, W]  ## output [H, W, C]
        print(rawTensor)
        self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)

then the broken frame will gone, but i didn't know why. it seems that print(rawTensor[0][0][0]) is also ok, but print(rawTensor.shape) or print(dir(rawTensor)) not work. maybe my code has some problem. if you know where is my mistake make please tell me, thx!

++++++++++++++++++++++++++++++++++++++update str(tuple(rawTensor)) is also working like print(rawTensor),

frame = vpf.tensor2numpy(raw_tensor)
cv2.show('im', frame)
open("np1.txt", "w").writelines(str(tuple(frame)))

np1.txt is like

(array([[ 81, 234, 113],
       [  0, 236, 200],
       [ 54, 108, 152],
       ...,
       [ 99,   1,  61],
       [141,  71, 148],
       [191, 192,  63]], dtype=uint8), array([[152,  10,  19],
       [241,   9,   6],
       [141,  73, 148],
       ...,
       [200,   1,  90],
       [149, 105,  56],
       [191, 192, 192]], dtype=uint8), array([[203, 247, 193],

but when i test like

    def tensor2numpy(self, rawTensor):  #### input [C, H, W]  ## output [H, W, C]
        open("np2.txt", "w").writelines(str(tuple(rawTensor)))
        self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)

np2.txt look like:

(array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       ...,
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=uint8), array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       ...,

but the picture should be black.

An other problem is that，when called tensor2numpy，the returned array arrangement is different，although the shape and dtpye are same; so need return a plane array and try to reshape and view if the array arrangement is correct arrangement; this problem may result from my python code，but i not understood C++。

it's wired that now i use torch.to("cpu") and torch.to("cuda") is as fast as vpf, and the Occupancy rate of CPU is also low（15%）rather than 80+%。it makes me crazy

NVIDIA / VideoProcessingFramework

any sample for webcam,thank you very much #242