NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

Enable VideoReader to read variable frame-rate videos via flag #1775

Open lucadealfaro opened 4 years ago

lucadealfaro commented 4 years ago

I have many videos that are almost-constant frame rate. I know in principle I should convert them to constant frame rate e.g. via ffmpeg before reading them, but that's an expensive operation to have in an automated pipeline. As the videos are "almost-constant" frame-rate, is it possible to get VideoReader to disregard the constant-framerate check via a special option/flag? This would be much appreciated.

JanuszL commented 4 years ago

Hi, This option won't help you, as it is not possible to be 100% sure if the video is CFR or VFR until it is fully processed. Due to the fact how DALI determines if the decoded frame is one that it asked the decoder for, VFR are just not supported in the current design and processing them leads to hang. Still, it is a desired feature and we have it on our ToDo list (but we cannot commit to any date).

lucadealfaro commented 4 years ago

Maybe I haven't been clear... I don't want an option that is foolproof. I just want to make VideoReader more tolerant. I understand that if I specify the flag, VideoReader may consider as equally spaced some frames that are not, but at least it would read the file. That is, when the option is specified, perhaps VideoReader could just decode the file into a sequence of frames and not care if the frame rate is constant or not?

JanuszL commented 4 years ago

As said, it won't work, if frames are not equally spaced then the decoder would just hang as it would wait for a frame that doesn't exist in the video with a given timestamp. DALI when decoding a video sends a stream of data between keyframes to the decoder and then waits for the frame with a given timestamp decoded. If the frame doesn't exist (because it is VFR video) then it would wait infinitely.

lucadealfaro commented 4 years ago

I see. Why this logic, rather than sending a stream of data between keyframes to the decoder, and then accepting all the frames back, rather than waiting with the ones with a specific timestamp? Can you point out to me where this logic is implemented precisely?

Do you happen to know if the limitation of decoding only video with constant frame rate also in the NVidia Video Codec SDK?

Sorry for the bother, but recoding videos to ensure constant frame rate is quite an expensive operation...

JanuszL commented 4 years ago

Do you happen to know if the limitation of decoding the only video with constant frame rate also in the NVidia Video Codec SDK?

It is only the way how DALI works. NVDEC handles that gracefully. Regarding the logic, you can check https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/nvdecoder/nvdecoder.cc#L316. It is not that easy to lift that assumption as the VideoReader works on frames, so it splits video into equally time spaced frames, and when asking for the n-th frame it calculates its timestamp taking into account FPS. In VFR frames with some timestamp just doesn't exist so in such case reader would wait for some frame in the mentioned line infinitely. In general, we can rephrase the question in the following way, when you ask the reader for a 10 frame sequence starting from some time point in the video, how to decide which decoded frames should be picked up. If video is CFR you can easily translate frame number (index) into timestamp that the decoder returns, in VFR it is not possible and a completely different approach is needed.

lucadealfaro commented 4 years ago

One simple API would be:

When the reader reads a video file, it returns the list of frame timestamps. Then, one is free to figure out for which timestamps to ask for the frames. So:

I will check the code ...

JanuszL commented 4 years ago

When the VideoReader operator is created it builds a list of all possible sequences. Now it is enough to open a file, get frame count and based on the sequence's length, step and stride a list is build. This list is then shuffled globally before any sample is selected to be decoded. If we are going to list all frame time stamps the initialization could last very long as it requires to parse the whole stream for each file. Maybe we can build some index files as we do for TFRecord or RecordIO? Also, I don't know if that approach is what people expect from VFR. @a-sansanwal what do you think?

a-sansanwal commented 4 years ago

@JanuszL I have a change that enables VFR decoding. It works fine but It has few drawbacks.

If we are going to list all frame time stamps the initialization could last very long as it requires to parse the whole stream for each file.

This is one of the drawbacks. If I recall correctly for example a 1000 videos with 6000 frames each took just about a minute to index everything in my testing. My workaround for that was saving the index to disk somehow so that it need not be generated again. I havent implemented that. The other drawback is memory consumption goes up, because we have to hold the index for all the frames. I tried to workaround that by making it possible to decode vfr videos by indexing only key-frames. This does not seem possible.

@lucadealfaro @JanuszL Another approach is possible. I think if the user passes a flag hinting videoreader that they will not use step/stride/random and just want whatever frames arrives to be decoded in the same order. In that case I think a simple hack is possible that allows decoding vfr videos by disabling just a few checks. Let me know if youre interested in this method.

JanuszL commented 4 years ago

If we are going to list all frame time stamps the initialization could last very long as it requires to parse the whole stream for each file.

This is one of the drawbacks. If I recall correctly for example a 1000 videos with 6000 frames each took just about a minute to index everything in my testing. My workaround for that was saving the index to disk somehow so that it need not be generated again. I havent implemented that.

We do a similar thing - dumping saved indices to the disc in the COCOReader - dump_meta_files option. Or prepare custom script just for that (I don't know how easy is that) as we do for TFRecord or RecordIO.

The other drawback is memory consumption goes up, because we have to hold the index for all the frames. I tried to workaround that by making it possible to decode vfr videos by indexing only key-frames. This does not seem possible.

How much do you think memory would it consume?

@lucadealfaro @JanuszL Another approach is possible. I think if the user passes a flag hinting videoreader that they will not use step/stride/random and just want whatever frames arrives to be decoded in the same order. In that case I think a simple hack is possible that allows decoding vfr videos by disabling just a few checks.

This seems more like an option for the inference, but in this case, user usually wants to read directly from a stream or dynamically provide the files to read - things that DALI doesn't support yet anyway.Hi, Currently, those ops only support uint8, but the code is more generic and lifting that limitation should not be that difficult. @mzient - what do you think?

a-sansanwal commented 4 years ago

How much do you think memory would it consume?

If i recall correctly it was around 100 to 200 MB.

lucadealfaro commented 4 years ago

The approach of passing a flag in case one wants to read all frames would be fine for me. The typical use case is of reading all frames to then pass them to something like Optical Flow or similar.

In my slight ignorance, I don't know why it is easier to define step/stride/random based on timestamps rather than simply saying "get me all frames such that frame_idx mod N = K, or even frame_idx mod N = 0, that is, striding simply on frame indices in the video. But I don't know the underlying code well enough to understand why striding on time is easier/preferable. For videos of constant frame rate, it would clearly be equivalent, and for videos with variable frame-rate, the definition would still work (and it would be sufficient for our use cases).

JanuszL commented 4 years ago

How much do you think memory would it consume?

If i recall correctly it was around 100 to 200 MB.

@a-sansanwal I don't think it is that much these days.

In my slight ignorance, I don't know why it is easier to define step/stride/random based on timestamps rather than simply saying "get me all frames such that frame_idx mod N = K, or even frame_idx mod N = 0, that is, striding simply on frame indices in the video.

I'm not saying it is easier (if I understand correctly what you are saying), but I guess that this randomness could be a thing that the network expects. That you are not feeding the network with sequences of frames in the order of their appearance in the file list, but with some randomness - sequence form video 5, video 1, video 2, again video 1, etc. If you know that there are classes of problems that don't require such behavior I would be happy to learn more. Could you also tell more about the use case/network you are working on? And a reference to a research paper would be more than welcome.

lucadealfaro commented 4 years ago

The application is simply that we have short videos taken by webcams that we wish to analyze. The videos are shortish (< 1 minute), but afaik the frames in each video are in time order: in H.264, there are some key frames, and then the other frames are difference-encoded... (P frames etc). We don't have a network involved. These are simply videos recorded by webcams, and the issue is that due to webcam idiosynchrasies, some videos are not produced with constant frame rate (likely due to characteristics of the encoders).

We wish to take one such video, extract the frames to pass them to Optical Flow, then do stuff with the result; we are not interested in doing it in random order or mixing the frames from different video.

JanuszL commented 4 years ago

We wish to take one such video, extract the frames to pass them to Optical Flow, then do stuff with the result; we are not interested in doing it in random order or mixing the frames from different video.

So it is very similar to what usually happens in inference. @a-sansanwal I think when you add a VFR support we can add an option to do this ordered read, what do you think?

a-sansanwal commented 4 years ago

Yes, adding this ordered read was already in my todo. I believe it would benefit even in cases where user wants to decode cfr videos. Currently we treat each sequence as independent, if the user were to provide certain hint or we could even infer this ourselves based on the paremeters we receive, we could squeeze out more perf.

DmitryUlyanov commented 4 years ago

HI @a-sansanwal , any progress on this one?

JanuszL commented 4 years ago

@DmitryUlyanov - nothing yet, but adding more generic support for VFR is on the top of our ToDo list.

dmenig commented 3 years ago

Very interested in this, has there been any advance ?

JanuszL commented 3 years ago

Very interested in this, has there been any advance ?

I'm sorry, but no progress so far.

dwrodri commented 3 years ago

For those of us who'd like to work with VFR videos, what are your thoughts on adding something like a VFRPolicy flag to VideoReader objects where you could choose between one of three policies:

I'm carousing the nvdecoder source now to get a gist for the complexity here, would thissort of behavior have to be handled within the sequence wrapper around the decoder?

I should clarify: if I can come up with a game plan, I'd be glad to create a PR for this, although it might take some time to test/implement. If this approach seems logical, I could start work on a PR soon.

awolant commented 3 years ago

@dwrodri thanks for your comment. I'm working on adding support for VFR in DALI right now. If I understand your idea correctly, the issue with it is that it works well, if we decode the video from the beginning (1 file = 1 sequence). In general case, we need to be able to get the sequence from arbitrary place in the video and regardless of the proximity to the keyframe corresponding to the frames we want to decode. Also, we need to be able to shuffle these sequences. We want to solve the issue of VFR support as comprehensively as we can. Could you tell more about your use case and expected behavior? This will help with the development of this feature.

dwrodri commented 3 years ago

@dwrodri thanks for your comment. I'm working on adding support for VFR in DALI right now. If I understand your idea correctly, the issue with it is that it works well, if we decode the video from the beginning (1 file = 1 sequence). In general case, we need to be able to get the sequence from arbitrary place in the video and regardless of the proximity to the keyframe corresponding to the frames we want to decode. Also, we need to be able to shuffle these sequences. We want to solve the issue of VFR support as comprehensively as we can. Could you tell more about your use case and expected behavior? This will help with the development of this feature.

We're in the process of evaluating DALI as a potential component of a preprocessing stack for near-real-time video analytics. We're doing inference on GPUs in the Cloud on footage collected on the edge. The footage is stored as files, so no streaming here. When the edge machines are under heavy load, they occasionally drop frames before uploading the footage. I'm a big fan of the way DALI exposes the decoding ASIC on Nvidia GPUs, but our test build currently has to check video for missing frames and correct the footage to constant frame rate.

So, we're processing many short videos, always starting from the beginning of the file, and many of them are occasionally missing a frame.

dmenig commented 1 year ago

Has there still been no advance on this ? I'm very interested in this too.

JanuszL commented 1 year ago

Hi @hyperfraise,

Please check the experimental video reader. It may lack some of the original video reader functionality byt should support VFR videos.