dukebw / lintel

A Python module to decode video frames directly, using the FFmpeg C API.
Apache License 2.0
261 stars 38 forks source link

Is it possible to use loadvid_frame_nums without reading entire video into memory? #4

Closed jon-barker closed 6 years ago

jon-barker commented 6 years ago

Hi Brendan, I'm trying to test lintel within a PyTorch training loop but I'm hitting some performance issues due to loading the entire byte array for a .mp4 into memory before passing it to loadvid_frame_nums to pull a sequence of frames. Is there anyway to avoid doing this as it's a major bottleneck when trying to sample large numbers of short random frame sequences.

Obviously in the case where a whole dataset fits in memory you could just load each video once, but very often that's not possible. Thanks!

dukebw commented 6 years ago

Hi Jon, would you be able to give specific details about the use case you have in mind?

Is the issue just with loadvid_frame_nums, or loadvid as well? It sounds like loadvid is the right API for uniformly sampling random frame sequences. This is what I use for training on the Kinetics dataset, for example.

I wrote that particular interface (loadvid_frame_nums) for use with the NTU RGB+D dataset, since the dataset has numbered skeleton data associated with specific frame numbers, and a friend requested an API to implement a specific frame sampling scheme he had in mind. The NTU RGB+D videos are all pretty short, around 3MB each. So in that case it's no problem to store all videos in HDF5, select a minibatch of them, and decode out specific frames according to your chosen sampling scheme.

The loadvid_frame_nums doesn't do any seeking in the video so this is definitely currently a limitation in using that specific API, if your videos are relatively large (compared with say, NTU RGB+D).

Is the issue that your mp4 videos are all individually very large? I guess a simple fix would be to split the videos into smaller chunks beforehand. I would be curious if you know of a more elegant solution.

jon-barker commented 6 years ago

We (NVIDIA) recently released a video data loader too: NVVL Some folks online have asked about how it compares to lintel after seeing your reddit post, so I'm trying to understand that myself.

The use case I have been using for testing is a relatively small dataset of about 60 540p videos ranging from 2MB to 30MB each. In this case I think the most performant way to use lintel is to load the whole dataset into host memory and then sample frames from there. To keep the benefit of using a compressed video format I've been doing this by loading into byte arrays and using _load_frame_nums_to_4darray to sample random frames (with the indices provided by a PyTorch sampler). I could also use _load_video_to_4darray to load into numpy arrays and sample them, but then we'd lose the benefit of a compressed format.

So I have two questions really:

  1. Do you see anything wrong with my approach to using lintel that would prevent me from getting max performance?
  2. How would you handle a dataset that can't fit in host memory? e.g. the NTU RGB+D dataset appears to be 1.3TB - is HDF5 the recommended way to go there?

Thanks for your help!

dukebw commented 6 years ago

Yes NVVL looks awesome. I wish something like NVVL existed a year ago when I started messing around with video datasets. I will learn a lot by going through the NVVL code, for sure.

So I think I understand the confusion. _load_video_to_4darray is named poorly. It should be called _sample_frame_sequence_to_4darray. lintel.loadvid is not called to load the entire video; rather, it is called to extract a frame sequence num_frames long, sampled uniformly from inside the video. I will fix up the README to clarify this.

So to answer your questions:

  1. Use _load_video_to_4darray (hereafter renamed to _sample_frame_sequence_to_4darray) in your PyTorch Dataset object, which subclasses torch.utils.data.Dataset. Call _sample_frame_sequence_to_4darray in __getitem__. This means that for every minibatch, for each example, a random keyframe in the video is seeked to and num_frames frames are decoded from there. num_frames would tend to be small (if you were going to use them as input to a 3D ConvNet or optical flow algorithm), e.g., 32 frames.

  2. I think my answer to 1 also applies here. It is not necessary to read the entire dataset into memory at once, only enough to fill a data queue. I do think that HDF5 is a great way to store serialized data. E.g., what I have done is to first pre-process the entire dataset to resize the videos to the size expected by the input pipeline (e.g., 256x256 for an input pipeline that crops to 224x224 for input to a ResNet). Then I store each of these resized videos as byte arrays in an HDF5 file. This greatly relieves the I/O pressure, since the movies will be quite small (I think the entire Kinetics dataset of 400 000 video clips comes to only 90GB).

Note that with regards to NTU RGB+D, if you ignore the depth and resize the videos, the dataset becomes quite a manageable size (i.e., it could easily fit into the memory of a system with 64GB of DRAM.

jaredcasper commented 6 years ago

Thanks for your help Brendan! I work with Jon on NVVL. We got a bit sidetracked this past week but still plan to use lintel to understand the performance of using ffmpeg libs directly. I think you've answered all of our questions. Understanding that loadvid_frame_nums doesn't seek to the nearest keyframe like loadvid does helps us to understand some of the numbers we're seeing. We were using loadvid_frame_nums so that we could be in control of the randomness and it seemed like the more generic function to use, so in that sense it'd be nice to flesh out loadvid_frame_nums to support random seeking where the caller determines the random frame to seek to.

The use case we built NVVL for internally is large video datasets (TBs of video data), so they wouldn't fit into memory. While I see that it'd be possible to not load the whole dataset at a time using the current interface, in this case requiring that the whole file be loaded into memory before parsing it would limit the randomness we can achieve since you'd want to use up the whole video file once you've read it into memory. In NVVL we just use the built in IO functionality in ffmpeg to only read the parts of the files we want, instead of loading it all into memory and using a custom AVIOContext. (Although now that I've typed that out I realized I've never actually checked that that is what ffmpeg is doing :), but I'm fairly sure it is)

Nice work on the library! It's super helpful.

dukebw commented 6 years ago

Thank you Jared, I think you have a great point about loadvid_frame_nums being the more generic function.

I shall be interested in looking at FFmpeg's built-in I/O functionality in future, as an addition to the current scheme.

tea1528 commented 5 years ago

Hi Brendan, is it possible for you to share your code for storing video dataset to an HDF5 file and use lintel to decode it as byte string? It would be greatly appreciated if you could add some common use case examples for popular datasets like Charades and Kinetics.

Thanks!