NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.07k stars 615 forks source link

Add an operator for receiving video metadata #5630

Open treasan opened 1 week ago

treasan commented 1 week ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Should have (e.g. Adoption is possible, but the performance shortcomings make the solution inferior).

Please provide a clear description of problem this feature solves

The sample rate (fps) of videos may very and hence the time period a fixed number of frames represent also varies. Having access to either the fps, duration or even the concrete timesteps of each frame is often crucial in many tasks where the actual duration in seconds is more important than the number of frames. For example, I am decoding raw video bytes from a web dataset using the experimental video decoder and I am forced to retreat to other libraries that can give me this kind of information from the raw video bytes (specifically, pytorch's VideoReader API).

Feature Description

As a user I want to be able to extract information about the sample rate of a video alongside its decoded frames.

Describe your ideal solution

A new DALI operator that extracts the desired metadata from raw video bytes. An example video decoding pipeline reading from a webdataset (raw video bytes could also come from an external source):

@pipeline_def
def pipeline(tar_paths):
    raw_video = fn.readers.webdataset(tar_paths, ...)
    duration, fps = fn.get_video_metadata(...)
    video = fn.experimental.decoders.video(raw_video)
    return video, duration, fps

Describe any alternatives you have considered

No response

Additional context

No response

Check for duplicates

JanuszL commented 1 week ago

Hi @treasan,

Thank you for reaching out. Yes, that sounds like a good feature to add. Let us add this to our ToDo list. Could you also tell me how do you want to utilize this data further? To drive transformations or to feed the model?

treasan commented 1 week ago

Hey @JanuszL

I am training a model, which expects video snippets with a certain duration (in seconds). Furthermore it expects a timestep for each frame, which is used for a temporal positional encoding.

JanuszL commented 1 week ago

Thank you for the clarification. In this case, I think it would be best to return this data directly from the video decoder (at least timesteps for each frame), and or extend the decoder to decode not the number of frames but the duration.

awolant commented 1 week ago

Hello @treasan

thanks for creating the issue. To better understand the requirement I wanted to ask do your use case expect the samples to have the same number of frames or the number of frames varies per sample. If it varies is it due to the variable frame rates in the video or variable duration of frames in seconds or both? If it varies what is expected type and shape of the output in your desired framework?

treasan commented 1 week ago

Please have a look at another issue/question I have submitted #5626. I explain my pipeline there in more detail.

tl;dr:

  1. DALI pipeline: Loading raw video bytes from webdataset
  2. Python function: Peeking duration and fps metadata from raw video bytes and filter out unwanted videos beforehand (e.g. too short ones)
  3. DALI pipeline: Get raw video bytes, duration, fps from external source --> decode video --> return decoded video, duration, fps
  4. Python function: Cut out multiple consecutive snippets of certain duration (e.g. 3 secs) of respective videos based on fps/duration metadata. These snippets constitute one training sample. They get batched and fed to the model alongside their timesteps that were also calculated based on the fps/duration metadata.

So, optimal for my use-case would be a DALI operator that peeks this metadata from raw video bytes, as I am then able to filter them out before the decoding step (more efficient). This might be similar to the peek_image_shape operator, which gives certain information about an encoded image.