Open mfoglio opened 2 years ago
@mfoglio
YUVJ420P basically means YUV420P with JPEG colour range (0;255), it's same to YUV420P in terms of Nvdec settings, shall also correspond to nvc.PixelFormat.NV12
I shall add this to sample.
Thanks. Last question about video parameters: from this example https://github.com/NVIDIA/VideoProcessingFramework/blob/master/SampleDecodeMultiThread.py it looked like color space and range are needed to determine the correct pipeline to obtain RGB frames. Is this true or it is sufficient to know the format (YUVJ420P, YUV420 YUV444, etc)?
I am not sure color range and color space can be accessed from pyav: https://github.com/PyAV-Org/PyAV/pull/686
Hi @mfoglio
color space and range are needed to determine the correct pipeline to obtain RGB frames
In VPF, there are 2 different color spaces supported (bt601, bt709) and 2 color ranges (mpeg, jpeg) which gives 4 possible ways of nv12 > rgb color conversions.
If you provide converter with wrong parameters, it will do the conversion anyway, but the colors will be slightly off. Pictures below illustrate this case (taken from #226). They were converted with different color spaces:
I don't know from personal experience, but those VPF users who use NN for live inference often say that color rendition accuracy is important aspect of inference prediction accuracy. This is the sole reason behind over complicated color conversion API (you can't "just" convert nv12 or yuv420 to rgb).
I'll investigate on color space and color range extraction with PyAV. As a plan B we always have a ffprobe. Regarding the color yuvj420p pixel format, it's clear indicator of yuv420p + jpeg color range, this is what's said in FFMpeg pixel format description.
Hi @mfoglio
ffprobe
can actually produce JSON output so parsing is easy and clean.
I've replaced PyAV with ffprobe
, all necessary stream parameters are now extracted including color space and color range. These are optional, not all streams have them. I've used BT601
and MPEG
as default values.
Now there are less dependencies and more useful information )
Please check out SampleDecodeRTSP.py
in master
ToT.
yuvj420p @mfoglio @rarzumanyan we got the same problem.
vpf to decode rtmp url and Spawn FFMpeg sub-process to decode, but vpf got wrong result and FFMpeg got None reuslt.
@stepstep123
Please elaborate on that, what is "wrong result"?
Hello @rarzumanyan , after several attempts, I think I managed to fix my code and VPF seems to be stable right now. I will try to optimize my code to reduce VRAM usage and I will let you know if I find any issue. Thanks!
Thanks for the update, @mfoglio
Glad to hear you you was able to fix it, please LMK if you need further assistance. After we resolve this issue, please consider sharing your findings regarding massive RTSP processing, I'm sure it will be extremely helpful to other VPF users.
Thanks @rarzumanyan . I have a question regarding optimization. I have a process running my main application, and a few other processes running VPF to decode one video stream each. Now that with VPF we uses processes instead of threads, is there any benefit in using cuda streams with VPF? In other words, when we put VPF decoding of a video stream into a Process
instead of a Thread
, do we still need to use streams to avoid conflicts across different processes?
Hi @mfoglio
Honestly I'm running out of depth here.
To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.
I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.
Hi @mfoglio
Honestly I'm running out of depth here.
To my best knowledge, primary CUDA context is created per device per process, so I don't know the answer to this question right now regarding the thread vs. process aspect.
I'll investigate on it and will update you as I find something. Meanwhile I can only recommend to test and observe actual behavior.
Thanks @rarzumanyan .
After decoding the packet,
in SampleDecodeRTSP.py,
`
# Decode
enc_packet = np.frombuffer(buffer=bits, dtype=np.uint8)
pkt_data = nvc.PacketData()
try:
surf = nvdec.DecodeSurfaceFromPacket(enc_packet, pkt_data)
if not surf.Empty():
fd += 1
# Shifts towards underflow to avoid increasing vRAM consumption.
if pkt_data.bsl < read_size:
read_size = pkt_data.bsl
# Print process ID every second or so.
fps = int(params['framerate'])
if not fd % fps:
print(name)`
I save to jpeg like this
`
yuv = to_yuv.Execute(surf, cc2)
rgb24 = to_rgb.Execute(yuv, cc2)
rgb24.PlanePtr().Export(surface_tensor.data_ptr(), w * 3, gpu_id)
# PROCESS YOUR TENSOR HERE.
# THIS DUMMY PROCESSING WILL JUST MAKE VIDEO FRAMES DARKER.
dark_frame = torch.floor_divide(surface_tensor, 2)
pil = Image.fromarray(surface_tensor.cpu().numpy())
pil.save('output/%d.jpg'%index)`
Is it correct?
I save to jpeg like this Is it correct?
If it works, it's correct ;) If not - inspect raw RGB frame with OpenCV to see what's happening.
I save to jpeg like this Is it correct?
If it works, it's correct ;) If not - inspect raw RGB frame with OpenCV to see what's happening.
It works.
I want to find(or write) the more efficient code, but I can't achieve now :(
Thank you!
Hi @stu-github
I want to find(or write) the more efficient code, but I can't achieve now :(
Could you start a new issue on that topic? This one is getting chunky.
Hi @mfoglio
What's the current status of this issue? Do you see improvements / can we close it?
Hi @stu-github , I am waiting for @rarzumanyan to finish up some fixes. I will share the code as soon as we have something more stable working
Hi @mfoglio, it would be very helpful if you can share your code, or some tips you've learned along the way.
Also, I see that PytorchNvCodec.cpp was recently updated to support specifying CUDA stream* and asynchronous copying when creating a PyTorch tensor. This is useful.
* In current implementation, the torch::full
ignores the user provided CUDA stream and operates on the globally set CUDA stream of PyTorch, but it can be fixed.
Hi @jeshels
In current implementation, the torch::full ignores the user provided CUDA stream
Thanks for bringing this up. I'm not an expert in torch so if you find thing like that please feel free to submit an issue.
P. S. As far as I understood the torch c++ tensor creation API, torch::full doesn't accept CUDA stream as argument. Am I missing something?
@rarzumanyan, sure thing 👍
PyTorch API is different. Instead of accepting stream
as a parameter for every function, one needs to set the CUDA stream separately. Then, everything which executes afterwards is run in the context of that CUDA stream until a new CUDA stream is set. This can be controlled from both Python and C++. As far as I understand, setting a CUDA stream in one thread doesn't affect other threads.
Since I'm a novice in this subject myself, I'm not sure which way is more appropriate for this case.
Note that if you'd like to go with a C++ based solution, the function getStreamFromExternal()
may be useful if you're having trouble passing a Pytorch CUDA stream object from Python to C++. However, note that due to this issue, this solution would require PyTorch version >= 1.11.0.
@mfoglio , hello, i have the same question with you.I want to drop some frames, not every frame is decoded. How did you solve this problem? What should I do based on VPF? Tansks a lot.
- Build VPF with
USE_NVTX
option and launch it under Nsight Systems to collect application timeline.
@rarzumanyan can you explain how to do this?
I want to decode as many RTSP streams as possible on a single GPU. Since my application is incapable of processing 30 FPS per streams, it wouldn't be an issue if some of the frames would be dropped. I probably won't need more than 5 FPS per streams. I am assuming there could be way to reduce the workload by dropping data at some unknown-to-me step during the pipeline. I would also need to process the streams in real time. When following the PyTorch tutorial from the wiki I found some kind of delay: if I stopped my application for a while (e.g.
time.sleep(30)
) and then I resumed it, the pipeline was returning me frames from30
seconds ago. I would like the pipeline to always return real-time frames. I believe this would also imply using less memory since older data could be dropped. Memory is particularly important for me since I want to decode many streams. I just know the high details of h264 video decoding. I know that P, B, and I frames mean that you cannot simply drop some data and then start decoding without possibly encountering corrupting frames. However, I have encountered before similar issues withgstreamer
on CPU (high CPU usage, more frames decoded then needed, delays and high memory usage) and I came up with a pipeline that was able to reduce delays (therefore also saving memory) while always returning me real-time (present) frames. How can I achieve my goal? Is there any argument I could pass to thePyNvDecoder
? I see it can receivedict
as argument but I couldn't find more details. Here's the code that I am using so far. It is basically the PyTorch wiki tutorial:Any hint on where to start would be really appreciated. This project is fantastic!