Closed xiang-zhe closed 3 years ago
Hi @xiang-zhe
Most web cameras produce video in either mjpeg compressed format of yuv422 raw format. E. g. this is how to check your webcam in Linux: (more information available at this ffmpeg doc page)
v4l2-ctl --list-devices
On my machine with Logitech web camera it outputs:
UVC Camera (usb-0000:00:14.0-13.3):
/dev/video4
/dev/video5
Then you check what video formats does it support:
ffmpeg -f v4l2 -list_formats all -i /dev/video4
Produces output:
[video4linux2,v4l2 @ 0x55dd5b6fc6c0] Raw : yuyv422 : YUYV 4:2:2 : 640x480 160x120 176x144 320x176 320x240 352x288 432x240 544x288 640x360 752x416 800x448 800x600 864x480 960x544 960x720 1024x576 1184x656 1280x720 1280x960
[video4linux2,v4l2 @ 0x55dd5b6fc6c0] Compressed: mjpeg : Motion-JPEG : 640x480 160x120 176x144 320x176 320x240 352x288 432x240 544x288 640x360 752x416 800x448 800x600 864x480 960x544 960x720 1024x576 1184x656 1280x720 1280x960
The problem is mjpeg isn't supported by Nvdec: Nvdec matrix support and yuv422 files are too heavy to send over the USB in real time. So you either have to have a usb camera which outputs video in one of formats supported by Nvdec or use OpenCV as you do now.
And i didn't find the describe details about how to use vpf, only some samples in github
Sample scripts illustrate most typical use cases and show how to use VPF. Unfortunately I don't have enough bandwidth to support the documentation especially when new features are added or some bugs are fixed.
is there a method copy data from gpu to cpu with high efficiency
Yes, there's a PySurfaceDownloader
class for that, it's usage is shown in SampleDemuxDecode.py
.
thanks again,but i got some other questions,and i work on win10,
1,A Camera capture card was connected between my camera and my PC(by PCI not usb) and cmd ffmpeg -list_devices true -f dshow -i dummy
, it show
[dshow @ 0000014e3966d540] DirectShow video devices (some may be both video and audio devices)
[dshow @ 0000014e3966d540] "Game Capture 4K60 Pro MK.2 Video"
[dshow @ 0000014e3966d540] Alternative name "@device_pnp_\\?\pci#ven_12ab&dev_0710&subsys_000e1cfa&rev_00#4&38ab2860&0&0008#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\{6f814be9-9af6-43cf-9249-c03401000226}"
[dshow @ 0000014e3966d540] "CYP USB Video Device"
[dshow @ 0000014e3966d540] Alternative name "@device_pnp_\\?\usb#vid_5000&pid_3104&mi_00#6&172a5fb1&0&0000#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\global"
[dshow @ 0000014e3966d540] "OBS-Camera"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{27B05C2D-93DC-474A-A5DA-9BBA34CB2A9C}"
[dshow @ 0000014e3966d540] "OBS-Camera2"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{27B05C2D-93DC-474A-A5DA-9BBA34CB2A9D}"
[dshow @ 0000014e3966d540] "screen-capture-recorder"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{4EA69364-2C8A-4AE6-A561-56E4B5044439}"
[dshow @ 0000014e3966d540] "Camera (NVIDIA Broadcast)"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{7BBFF097-B3FB-4B26-B685-7A998DE7CEAC}"
[dshow @ 0000014e3966d540] "OBS Virtual Camera"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{A3FCE0F5-3493-419F-958A-ABA1250EC20B}"
[dshow @ 0000014e3966d540] "Elgato Screen Link"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{860BB310-5D01-11D0-BD3B-00A0C911CE86}\{D2F41684-D46F-440B-8096-4FCD528ED5A3}"
[dshow @ 0000014e3966d540] DirectShow audio devices
[dshow @ 0000014e3966d540] "立体声混音 (Realtek(R) Audio)"
[dshow @ 0000014e3966d540] Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{30E55EA4-BA1D-465B-B217-ACF05E273FAB}"
[dshow @ 0000014e3966d540] "Game Capture 4K60 Pro MK.2 Audio"
[dshow @ 0000014e3966d540] Alternative name "@device_pnp_\\?\pci#ven_12ab&dev_0710&subsys_000e1cfa&rev_00#4&38ab2860&0&0008#{33d9a762-90c8-11d0-bd43-00a0c911ce86}\{6f814be9-9af6-43cf-9249-c03401000326}"
[dshow @ 0000014e3966d540] "virtual-audio-capturer"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\{8E146464-DB61-4309-AFA1-3578E927E935}"
[dshow @ 0000014e3966d540] "OBS-Audio"
[dshow @ 0000014e3966d540] Alternative name "@device_sw_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\{B750E5CD-5E7E-4ED3-B675-A5003C439997}"
[dshow @ 0000014e3966d540] "麦克风 (NVIDIA Broadcast)"
[dshow @ 0000014e3966d540] Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{169986C3-9209-41F3-8EDC-BBFA74D73DB8}"
[dshow @ 0000014e3966d540] "麦克风 (CYP USB Audio Device)"
[dshow @ 0000014e3966d540] Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{A6529422-92D8-458B-ACE7-6261BAE98487}"
[dshow @ 0000014e3966d540] "麦克风 (2- USB Audio Device)"
[dshow @ 0000014e3966d540] Alternative name "@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\wave_{E760F9FA-F5C5-4A79-8E0A-8CB59546C79C}"
dummy: Immediate exit requested
cmd ffmpeg -list_options true -f dshow -i video="Game Capture 4K60 Pro MK.2 Video"
, it shows
[dshow @ 000001b5c73fd5c0] DirectShow video device options (from video devices)
[dshow @ 000001b5c73fd5c0] Pin "Video Capture" (alternative pin name "0")
[dshow @ 000001b5c73fd5c0] pixel_format=yuyv422 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=yuyv422 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=yuv420p min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=yuv420p min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=nv12 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=nv12 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=bgr24 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=bgr24 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=bgr0 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=bgr0 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] unknown compression type 0x30313050 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] unknown compression type 0x30313050 min s=1920x1080 fps=inf max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=yuyv422 min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0] pixel_format=yuyv422 min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=yuv420p min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0] pixel_format=yuv420p min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=nv12 min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
[dshow @ 000001b5c73fd5c0] pixel_format=nv12 min s=1920x1080 fps=25 max s=1920x1080 fps=50
[dshow @ 000001b5c73fd5c0] pixel_format=bgr24 min s=1920x1080 fps=29.97 max s=1920x1080 fps=60.0002
if cmd ffprobe -show_format -f dshow -i video="Game Capture 4K60 Pro MK.2 Video"
, it shows
Input #0, dshow, from 'video=Game Capture 4K60 Pro MK.2 Video':
Duration: N/A, start: 274327.476000, bitrate: N/A
Stream #0:0: Video: rawvideo (YUY2 / 0x32595559), yuyv422, 1920x1080, 50 fps, 50 tbr, 10000k tbn, 10000k tbc
[FORMAT]
filename=video=Game Capture 4K60 Pro MK.2 Video
nb_streams=1
nb_programs=0
format_name=dshow
format_long_name=DirectShow capture
start_time=274327.476000
duration=N/A
size=N/A
bit_rate=N/A
probe_score=25
[/FORMAT]
is that mean my cam supports those raw formats(yuyv422, yuv420p,nv12,bgr24,bgr0)? so i NOT need to decode and just receive and convert to RGB array. if that, how can i receive the raw video by vpf(now i use opencv); or,i didn't get the right video format, or i misunderstood the video transmission method,
2,now i get a array by opencv,i‘ll use nvc.PyFrameUploader
to load it to GPU surface,
nvUpl = nvc.PyFrameUploader(int(w), int(h), nvc.PixelFormat.RGB, gpuID)
surface_tensor = torch.zeros(h, w, 3, dtype=torch.uint8, device=torch.device(f'cuda:{gpuID}'))
rawSurface = nvUpl.UploadSingleFrame(rawFrame) #rawSurface.Format() == nvc.PixelFormat.RGB
then convert it to torch.tensor like that,
rawSurface.PlanePtr().Export(surface_tensor.data_ptr(), w * 3, gpuID)
but i saw antoher way by pnvc, like that,
# Export to PyTorch tensor
surf_plane = rgb24_planar.PlanePtr()
img_tensor = pnvc.makefromDevicePtrUint8(surf_plane.GpuMem(),
surf_plane.Width(),
surf_plane.Height(),
surf_plane.Pitch(),
surf_plane.ElemSize())
img_tensor.resize_(3, target_h, target_w)
img_tensor = img_tensor.type(dtype=torch.cuda.FloatTensor)
img_tensor = torch.divide(img_tensor, 255.0)
any difference between those two ways? or i just use one of them both OK.
3, after processing by torch, i get a tensor in GPU, i know there is a nvDwn.DownloadSingleSurface
class can convert surface to numpy,
is there a method copy data from gpu to cpu with high efficiency
Yes, there's a
PySurfaceDownloader
class for that, it's usage is shown inSampleDemuxDecode.py
.
but i didn't know how to convert a torch.tensor to a surface, there're some issues like https://github.com/NVIDIA/VideoProcessingFramework/issues/109, https://github.com/NVIDIA/VideoProcessingFramework/issues/118, but i didn't get useful infos,
Unfortunately, the reverse procedure isn't implemented and there's no way to convert a Pytorch tensor to VPF surface (which is plain
CUdeviceptr
).
i didn't know is there any thing updated , or i can't convert a torch.tensor on GPU to a surface directly. thanks.
Hi @xiang-zhe
so i NOT need to decode and just receive and convert to RGB array
Those formats you've mentioned are raw image formats, they are not compressed. You don't need to decode them.
how can i receive the raw video by vpf
If your camera supports raw yuv420p / nv12 / rgb output, you don't need VPF to obtain raw video frames from device. You can do that just fine with OpenCV because there's nothing a GPU can accelerate here - you just receive array of pixels over USB.
As soon as you get your video frame as uint8
numpy array with the help of OpenCV, you may upload it to GPU using PyFrameUploader
and then export to torch.tensor
for NN processing.
There are 2 ways to export VPF Surface
to Pytorch tensor:
PytorchNvCodec
module which utilizes Pytorch C++ API for that. It introduces the dependency against torch
module in your python code.SurfacePlane.Export
method which is a simple CUDA DtoD memcopy. It doesn't introduce dependency against torch
in your python code.Both options are perfectly fine, just use whatever works best for you. One thing to take care of is the pixel format - some NN expect your tensor to be planar float32
RGB, some prefer interleaved float32
RGB, normalization range may be different and such. You can find some examples of Surface
pre-processing in SampleTorchResnet.py
and SampleTensorRTResnet.py
.
i didn't know how to convert a torch.tensor to a surface
Please take a look at SamplePyTorch.py
:
https://github.com/NVIDIA/VideoProcessingFramework/blob/b896bef16a58e1183bcaa4406bd6b5024e890e50/SamplePyTorch.py#L73-L98
If you just want to download the content of your torch.tensor
to numpy array, you don't need VPF for that, use torch.tensor.cpu().numpy()
instead.
Hi @rarzumanyan , I was so careless that I didn't see it,
i didn't know how to convert a torch.tensor to a surface
Please take a look at
SamplePyTorch.py
:
in my case, ther’e‘re three time-consuming processes, 1,numpy to gpu, ~0.016s, if using numpy.to("cuda"), now using PyFrameUploader (<0.004s) 2, processing by NN, ~0.01s 3, tensor.cuda to numpy, ~ 0.02s, if using tensor.to("cpu"), so i hope to reduce it by VPF(tensor to surface, then to numpy by nvDwn.DownloadSingleSurface) instead of using
torch.tensor.cpu().numpy()
i'll try it, =================================update i tested it, code like
class VPF():
def __init__(self, width, height, gpuID):
#self.init_vpf()
self.w = width
self.h = height
self.gpuID = gpuID
self.surface_tensor = torch_zeros(self.h, self.w, 3, dtype=torch_uint8, device=torch_device(f'cuda:{self.gpuID}'))
self.surface_rgb = nvc.Surface.Make(nvc.PixelFormat.RGB, self.w, self.h, self.gpuID)
self.frame = numpy_ndarray(shape=(self.surface_rgb.HostSize()), dtype=numpy_uint8)
self.nvUpl = nvc.PyFrameUploader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)
self.nvDwn = nvc.PySurfaceDownloader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)
def numpy2tensor(self, rawFrame):
rawSurface = self.nvUpl.UploadSingleFrame(rawFrame) #rawSurface.Format() == nvc.PixelFormat.RGB
rawSurface.PlanePtr().Export(self.surface_tensor.data_ptr(), self.w * 3, self.gpuID) #surface_tensor.dtype == (torch.tensor, device = "cuda")
return self.surface_tensor
def tensor2numpy(self, rawTensor):
self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w * 3, self.gpuID)
success = self.nvDwn.DownloadSingleSurface(self.surface_rgb, self.frame)
if success:
return self.frame
else:
print('Failed to download surface')
it's weird that,numpy -> surface_rgb -> surface_tensor costs about 0.004s, but surface_tensor -> surface_rgb -> numpy costs about 0.016( faster than tensor.to("cpu")
which cost more than 0.02s, but this advantage is limited comparing to nump2tensor );
I do not understand the underlying mechanism of C++ ,but i think those two processes(numpy2tensor and tensor2numpy) are the same, why their costing are so different?
thanks again!
Hi @xiang-zhe There's no need to guess, you can easily collect performance profile of your application.
When building VPF from source, opt-in USE_NVTX
option to enable NVTX markers support. Then run your application under Nsight Systems profiler to see all the CUDA API calls and NVTX functions in application timelie.
Select Python interpreter, path to your script and it's arguments as target application. Opt-in "Collect CUDA trace" and "Collect NVTX trace" options.
You will see all VPF tasks in CUDA API calls in your app:
P. S.
Please pull origin master first, I've added missing color conversion contexts in SamplePyTorch.py
.
Hi @rarzumanyan , it seems that i didn't need color conversion because i upload numpy_rgb to surface_rgb directly; but i got another problem: it look like one broken frame (snowflakes picture) happened about every 10 frames ; and no memory leak for cpu and gpu;the first frame is always broken
rawFrame = vpf.tensor2numpy(rawTensor[0])
rawFrame =rawFrame.reshape((3,1080,1920)).transpose((1,2,0))
i count it twice, the numbers of broken frame like,
one like:
0
17
28
38
86
97
127
another like:
0
10
20
50
71
91
102
112
132
but when using :
rawFrame = rawTensor.to("cpu").numpy()[0]
it looks normal.
thanks again
++++++++++++++++++++++++++++++++update it seems related to pytorch.jit module,when commenting pytorch.jit,most of frames are broken but one is fine(it looks like reverse comparing to the case above).
self.model.load_state_dict(torch.load(checkpoint, map_location=device))
#self.model = torch.jit.script(self.model)
#self.model = torch.jit.freeze(self.model)
self.device = device
+++++++++++++++++++++++++++++++++update and it seems that tensor2numpy() has some problem. because the first frame is always broken, so i show it by two ways, code is like
cv2.show("im",vpf.tensor2numpy(raw_tensor[0][0]))
cv2.waitKey(0)
cv2.show("im", raw_tensor[0][0].to("cpu").numpy())
cv2.waitKey(0)
so the first picture is broken,and the second is fine.
when i change a usb logit webcam instead of the PCI Game Capture 4K60 Pro MK.2 Video,
all frames are broken with vpf.tensor2numpy(raw_tensor[0][0])
but fine with raw_tensor[0][0].to("cpu").numpy()
and my vpf is like:
vpf = VPF(width,height,0)
class VPF():
def __init__(self, width, height, gpuID):
#self.init_vpf()
self.w = width
self.h = height
self.gpuID = gpuID
#self.alpha = alpha
self.surface_tensor = torch_zeros(self.h, self.w, 3, dtype=torch_uint8, device=torch_device(f'cuda:{self.gpuID}'))
self.surface_rgb = nvc.Surface.Make(nvc.PixelFormat.RGB, self.w, self.h, self.gpuID)
self.frame = numpy_ndarray(shape=(self.surface_rgb.HostSize()), dtype=numpy_uint8)
self.nvUpl = nvc.PyFrameUploader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID)
self.nvDwn = nvc.PySurfaceDownloader(self.w, self.h, nvc.PixelFormat.RGB, self.gpuID) #self.surface_rgb.Format()
def numpy2tensor(self, rawFrame): ## input [H, W, C] ## output [C, H, W]
#print("inframe:", rawFrame.shape)
rawSurface = self.nvUpl.UploadSingleFrame(rawFrame) #rawSurface.Format() == nvc.PixelFormat.RGB rawSurface.Width()==1920
#print(dir(rawSurface))
#print(rawSurface.Width())
rawSurface.PlanePtr().Export(self.surface_tensor.data_ptr(), self.w*3, self.gpuID) ## self.surface_tensor.shape == torch.Size([1080, 1920, 3]) device == "cuda"
#print("outtensor:", self.surface_tensor.shape)
return self.surface_tensor.permute(2,0,1)
def tensor2numpy(self, rawTensor): #### input [C, H, W] ## output [H, W, C]
#print("intensor:", rawTensor.shape)
self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)
success = self.nvDwn.DownloadSingleSurface(self.surface_rgb, self.frame)
#print("outframe:", self.frame.reshape(self.w, self.h, 3).transpose((1,0,2)).shape)
if success:
ret = self.frame.reshape(3, self.h, self.w).transpose(1,2,0)
return ret
else:
print('Failed to download surface')
any help will be greatly appreciated~!
i found a way to fix the broken frame, but it looks stupid.
when calling tensor2numpy()
, add a line to print(rawTensor)
,
def tensor2numpy(self, rawTensor): #### input [C, H, W] ## output [H, W, C]
print(rawTensor)
self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)
then the broken frame will gone, but i didn't know why. it seems that print(rawTensor[0][0][0])
is also ok, but print(rawTensor.shape)
or print(dir(rawTensor))
not work.
maybe my code has some problem. if you know where is my mistake make please tell me, thx!
++++++++++++++++++++++++++++++++++++++update
str(tuple(rawTensor))
is also working like print(rawTensor)
,
frame = vpf.tensor2numpy(raw_tensor)
cv2.show('im', frame)
open("np1.txt", "w").writelines(str(tuple(frame)))
np1.txt is like
(array([[ 81, 234, 113],
[ 0, 236, 200],
[ 54, 108, 152],
...,
[ 99, 1, 61],
[141, 71, 148],
[191, 192, 63]], dtype=uint8), array([[152, 10, 19],
[241, 9, 6],
[141, 73, 148],
...,
[200, 1, 90],
[149, 105, 56],
[191, 192, 192]], dtype=uint8), array([[203, 247, 193],
but when i test like
def tensor2numpy(self, rawTensor): #### input [C, H, W] ## output [H, W, C]
open("np2.txt", "w").writelines(str(tuple(rawTensor)))
self.surface_rgb.PlanePtr().Import(rawTensor.data_ptr(), self.w*3, self.gpuID)
np2.txt look like:
(array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
...,
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]], dtype=uint8), array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
...,
but the picture should be black.
An other problem is that,when called tensor2numpy
,the returned array arrangement is different,although the shape and dtpye are same; so need return a plane array and try to reshape and view if the array arrangement is correct arrangement; this problem may result from my python code,but i not understood C++。
it's wired that now i use torch.to("cpu") and torch.to("cuda") is as fast as vpf, and the Occupancy rate of CPU is also low(15%)rather than 80+%。it makes me crazy
Thanks for the great jobs. In my case, i get video/frame from webcam using opencv, so i need copy it to gpu which costs a lot; after processing by model(in fact it costs little), i want the result to be numpy.narray dtype, so i'll copy it from gpu to cpu; so fisrt, could i use vpf to get video/frame from webcam without Host to Device copies; second, is there a method copy data from gpu to cpu with high efficiency;
And i didn't find the describe details about how to use vpf, only some samples in github and brief introduction in
https://developer.nvidia.com/blog/vpf-hardware-accelerated-video-processing-framework-in-python/
Any advice will be appreciated