Video processing use case

priyamdey commented 2 years ago

Hi. I have a requirement of reading a sequence of video frames to perform a classification task. I have gone through the documentation of dali.fn.readers.video but not able to figure it out for my use case. Here is what I want to achieve:

I have a simple video directory structure: root/{vid1.mkv, vid2.mkv, ...}. Each video is of different duration, but have a constant fps of, say, 24. I would like to extract the center frame from a chunk of 6 consecutive frames, i.e., for 24 frames in a second ⇾ 4 chunks (each of size 6) ⇾ 1 frame from each chunk ⇾ 4 output frames. Likewise, for each second, we keep getting 4 frames.
It's a multi-label classification problem. If there are C classes in total, any subset of C can occur for a frame. Therefore, I have a binary vector of size C for each frame as a label. If there are n such frames in a video, the labels form a corresponding numpy array with the size n x C for that video. This array is stored in a numpy file. The label directory structure is: root/{vid1.npy, vid2.npy, ...}.

Right now, I'm doing this in 2 parts: Use ffmpeg to extract frames@24fps. Then read the selective frames into cpu before loading to gpu. Extracting frames is a one-time process, but reading frames into cpu to form a long sequence is quite slow. I was thinking to do it directly from video to frames in gpu. Any suggestions / directions would be really helpful.

Thanks!

JanuszL commented 2 years ago

Hi @priyamdey,

Regarding 1), you can ask the reader to read 6 frames per sample, with step=6 so the sequences are nonoverlapping. Then you can use crop or tensor index to cut the middle frame.

As for the 2) case you can ask the video reader to return the frame number and label for each sequence. The label could be unique for each file. Then having the label that maps to the file name and frame number you can read a corresponding binary vector C in python (now it is not possible to do it efficiently in DALI besides python operator or custom C++ operator).

priyamdey commented 2 years ago

@JanuszL , thanks for the quick response. I didn't quite get you. Please see below:

I am a bit unsure if we're on the same page for a sequence. A sequence is of M time-ordered frames, where each of this frame came from a chunk, and we pick M consecutive chunks to achieve this. The sequences can be randomly picked from anywhere across all the videos (a sequence should not cross video boundaries, definitely). Did you have the same thing in mind as a sequence? If yes, can you please give an example of what you're suggesting.
How do we ask the video reader to return the frame number & label for a sequence? Not sure which flag to set in the api to achieve this.

JanuszL commented 2 years ago

Hi,

Yes, by a sequence I understand a set of consecutive video frames where stride is a distance between frames and step is the distance between the first frames of available sequences. So I think we are on the same page.

Please check the video reader documentation. enable_frame_num and enable_timestamps are ones you are looking for. This example show how to use them.

priyamdey commented 2 years ago

Thanks for mentioning the api flags for getting the frame numbers. I'll try with that. For the sequence part, here's what I understand: As u have suggested, we can use stride=6 to have distance of 6 b/w 2 frames in a sequence. Also, we can use step=6 to make sure any 2 sequence are 6 frames apart. Now how can we make sure that sequence formation starts from the middle frame of each chunk and not from the beginning of the chunk? Any way to skip first few frames before starting sequence formation?

JanuszL commented 2 years ago

You can use the file_list argument to specify allowed sequences (you probably need to generate it automatically as there are many sequences you work with).

priyamdey commented 2 years ago

Yeah that's a good way to specify the start frame. So entries of the type filename label start in a file along with stride=6, step=6 and sequence_length=M should be enough to generate the desired sequences. What should the label be for those entries?

JanuszL commented 2 years ago

You should also specify the end frame or all frames between start and the video end could be selected as the sequence start. I would use the label value to identify the file name so based on the frame number and returned label you can easily extract the label vector.

priyamdey commented 2 years ago

I see. Will put the end value as well too. I'll try this out. Thank you. Will update here if I get stuck somewhere.

NVIDIA / DALI

Video processing use case #3544