NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.13k stars 619 forks source link

Labels for video files with ops.VideoReader #666

Closed alicranck closed 3 years ago

alicranck commented 5 years ago

Is there any way to return labels with the videos loaded by ops.VideoReader? Maybe composing it with another op in some way? It would be really helpful for video classification tasks (action recognition etc.).

I would be happy to know if something like this exists or is planned. Thanks!

JanuszL commented 5 years ago

Hi, It is an excellent question. Currently, it is not possible. Could you write more about how your data set looks like, how labels and videos are stored, are there any other applications you consider beyond classification? Our initial goal for the Video loader was to support End-to-End Learning of Video Super-Resolution with Motion Compensation and see how people want to use it for the different cases like yours.

alicranck commented 5 years ago

Hi @JanuszL , Thanks for the reply!

I'm currently using the Kinetics-400 dataset that has the same structure that the ops.FilesReader uses (i.e. validation/train folders each containing 400 folders, 1 per class). But I think that since we pass a list of files to the reader anyway, the most flexible and general way to support labels would be to pass a matching list of labels.

Right now I don't have other applications in mind but I imagine that more people would want to move from image to video processing as the increase in compute capabilities allows this, and that this sort of functionality might be useful in this case.

Kh4L commented 5 years ago

Hi @alicranck ,

Indeed, currently VideoReader only supports a list of filenames as input.

Since video datasets as Kinetics-400 have a similar structure as FileReader, I think that we should extend VideoReader to return labels when a file_root would be provided as argument.

Thanks for the proposition! :+1:

Tracked DALI-614.

willprice commented 5 years ago

This would be super helpful. It'd be great to have some more flexible way of providing labels that just through filesystem hierarchy. For example, Something-something labels are provided as JSON and Jester they're a CSV, EPIC provides them in a CSV or pickle. Ideally having some way of providing a labelling function given the filename/path of a video would support all these use cases I've described and the kinetics use case.

willprice commented 5 years ago

I'm happy to chip in and implement this, although I'm not very familiar with CPP. I'm guessing one would want to do something similar to https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/reader/file_reader_op.h#L48 in https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/reader/loader/video_loader.cc#L443

which would necessitate some wrapper like ImageLabelWrapper but wrapping a Sequence and a file path. Would the implementation add to SequenceWrapper or wrap the sequence wrapper itself?

Kh4L commented 5 years ago

Hi @willprice ,

Sure, we would be happy to integrate your contribution! :-)

I think adding the labels to frame_starts_ would be the way to go - maybe by simply using std::vector<std::tuple<int, int, int>>, or even better, changing it to a struct SeqMeta for readability.

And then, as you suggested, the ReadSample would be

void VideoLoader::ReadSample(SequenceWrapper& tensor) {
    // TODO(spanev) remove the async between the 2 following methods?
    auto& seq_meta = frame_starts_[current_frame_idx_];
    push_sequence_to_read(filenames_[seq_meta.file_idx], seq_meta.frame_idx, count_);
    receive_frames(tensor);
    tensor.wait();
    ++current_frame_idx_;

    tensor.label = seq_meta.label;

    MoveToNextShard(current_frame_idx_);
}

and in https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/reader/video_reader_op.h#L68 , you would just have to set the Output #1 to tensor.label.

shecker commented 5 years ago

Hi everyone, Thanks for this great library. Correct me if I'm wrong, but isn't the VideoReader of DALI based off of the NVVL project? https://github.com/NVIDIA/nvvl In which case, wouldn't it be possible to handle obtaining labels just as in NVVL where you specify an optional callable for the VideoDataset class? Also would it be possible to perform label specific image augmentation? For example, I'd like to rotate my image by a label specified angle? Thanks a lot! Best, Simon

JanuszL commented 5 years ago

@shecker - that is true, the core of the Video decoder is based on nvvl but the logic around it is DALI specific. In the case of callable, it may be difficult as it doesn't fit into current DALI architecture as labels are yet another data in the pipeline. In the case of nvvl labels cannot be processed and they are loaded and outputted at the very end of VideoDataset class. To make it really DALI way VideoLoader need to be able to call this Python callback (it is somehow possible https://github.com/NVIDIA/DALI/pull/732 but could be terribly slow). Regarding label specific augmentations it seems to be doable by some custom operator translating image to parameters that drives other operators. Again https://github.com/NVIDIA/DALI/pull/73 could be some solution to that. This is my brief ideas but we still need more discussion before we can propose anything definite.

keunhong commented 5 years ago

The current architecture seems a bit too rigid. I am currently trying to read video frames together with the associated audio waveforms, but DALI doesn't seem to have a way to return the video frame numbers without writing a custom C++ op.

It would be helpful to have some sort of arbitrary lambda operation support so that unsupported data can be loaded. Even with the performance degradation I think the accelerated video loading and augmentation would make it worth it. Right now my dataloader's bottleneck is JPEG decoding, and DALI would alleviate that but there doesn't seem to be any way for me to use it.

Kh4L commented 5 years ago

Hi @alicranck ,

I am looking into https://deepmind.com/research/open-source/open-source-datasets/kinetics/ and from what I see, the dataset structure is arbitrary since they only provide you CSV and JSON containing the scenes metadata and YT location.

Do you often see the 1 folder/class structure in the literature or is this just how you chose to organize the dataset in your usecase?

willprice commented 5 years ago

I'll chime in as well @Kh4L, One folder/class is quite common, but so is a flat structure of all examples in a single folder.

alicranck commented 5 years ago

Hi @Kh4L , This is true, this structure comes from the script I used to download the videos. This type of structure is pretty common for classification tasks, and some frameworks have built in support for it (https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder), however it's definitely not the only use case.

I can think of many applications where the label would be an image/video as well (segmentation), a list of bounding boxes (detection), text (annotation) etc. These are not marginal cases in research today so I think providing maximum flexibility is important for people to be able to integrate DALI in their projects.

I don't really know what are the constraints you have to work with, but I imagine that having as optional input a list of labels (that could be any object) to be returned with the appropriate video, would answer most needs in that regard.

Kh4L commented 5 years ago

@willprice @alicranck right, thank you for you input!

Segmentation data support is on the roadmap, and supporting both image and video segmentation labels will be considered as soon as we get on it.

I don't really know what are the constraints you have to work with, but I imagine that having as optional input a list of labels (that could be any object) to be returned with the appropriate video, would answer most needs in that regard.

Sure but how would you parse the content of this "list of labels" ? Its quite hard to have a format and a generic parser supporting all the tasks.

alicranck commented 5 years ago

@Kh4L For me I know that even being able to provide a list of integers, that will be used as indices to a list of labels that I keep separately, would be useful.

This may result in sub-optimal performance but since you say that support for image and video labels is planned, this may cover most other use cases in a good-enough manner, and nullify the need to support many different formats.

jbohnslav commented 5 years ago

For action recognition, it would be great if we could get per-frame labels, not only video-level labels.

JanuszL commented 5 years ago

@jbohnslav - how do you think the user should pass those labels to the Pipeline itself? Some annotation file?

jbohnslav commented 5 years ago

@JanuszL

Yes, there can either be one annotation file per dataset or (potentially more simply) one annotation file per video. Here are two example formats: JSON with the starts and ends of each "action" in seconds, like the ActivityNet dataset:

"---9CpRcKoU": {
            "annotations": [
                {
                    "label": "Drinking beer", 
                    "segment": [
                        0.01000, 
                        12.64441
                    ]
                }
            ], 
            "duration": 14.07000, 
            "resolution": "320x240", 
            "subset": "training", 
            "url": "https://www.youtube.com/watch?v=---9CpRcKoU"
        }, 
        "--0edUL8zmA": {
            "annotations": [
                {
                    "label": "Dodgeball", 
                    "segment": [
                        5.46484, 
                        86.71838
                    ]
                }
            ], 
            "duration": 92.18000, 
            "resolution": "640x480", 
            "subset": "training", 
            "url": "https://www.youtube.com/watch?v=--0edUL8zmA"
        }
...

For the AVA dataset, it's a .csv file with columns: video, frame, person box (4 coordinates for bounding box), action_id (integer denoting class), and person_id denoting which person in the frame was doing the action. image

A disk-inefficient, but easy-to-parse format I find useful is for each video, have a corresponding .csv file with the same number of rows as there are frames in the video. It has either 1 column with an integer for single-class classification, or N columns with 0s and 1s in a multilabel case.

JanuszL commented 5 years ago

@jbohnslav - understood. Tracked as DALI-890.

ghost commented 5 years ago

@alicranck , @willprice , The initial requirement for VideoReader operator to generate and return labels based on file directory structure or using a file_list argument is implemented via https://github.com/NVIDIA/DALI/pull/1029 and https://github.com/NVIDIA/DALI/pull/998. It will be available in today's nightly build.

cinjon commented 5 years ago

Hi, I got really excited to see @jbohnslav and @JanuszL's last two comments re ActivityNet style of annotation loading in the VideoPlayer. However, I don't see how to do this in the examples or in the code re the two PRs that @ArunaUMedhekar mentioned (#1029 and #998). Is there an updated tutorial for this? I'm looking to be able to load ActivityNet style annotations along with the paired video (and ideally know which frames are used so I can further decipher which annotation segments are relevant).

Thanks for your help.

JanuszL commented 5 years ago

Hi, Currently you can find the examples with multiple videos and labels.

cinjon commented 5 years ago

Hi, thanks for the quick reply. I've seen those two links and, as far as I can tell, they don't display a method for including annotations like is done in ActivityNet. I'm probably getting something wrong but the labels that it provides seems to be related to the directory structure rather than an external file with segment annotations. Is that right?

JanuszL commented 5 years ago

Hi, In this case, you need to write some custom code that will correlate returned classes with the annotations you have and then pass it further to the model. Currently, there is no easy out of the box way. @a-sansanwal any hint?

cinjon commented 5 years ago

Help on this would be really fantastic. @jbohnslav did you figure this out?

suriyachaudary commented 5 years ago

Hi,

I am trying to read N frames (stride of ~10) at a time from a video and also it's label (one label per video). VideoReader with file_list option throws terminate called after throwing an instance of 'dali::DALIException' what(): [/opt/dali/dali/pipeline/operators/reader/loader/video_loader.cc:358] 0: failed to seek frame 0

Any help is appreciated. Thanks!

JanuszL commented 5 years ago

I am trying to read N frames at a time from a video and also it's label (one label per video). VideoReader with file_list option throws terminate called after throwing an instance of 'dali::DALIException' what(): [/opt/dali/dali/pipeline/operators/reader/loader/video_loader.cc:358] 0: failed to seek frame 0 Any help is appreciated. Thanks!

@a-sansanwal ?

a-sansanwal commented 5 years ago

@suriyasingh can you upload the video youre trying to read from ?

suriyachaudary commented 5 years ago

@a-sansanwal they were videos from UCF101 dataset re-encoded with h264_nvenc. This seems to have been fixed in #1287.

cinjon commented 5 years ago

@a-sansanwal Just in case it was lost in the shuffle, did you see my problem as well? Thanks.

a-sansanwal commented 5 years ago

@suriyasingh Recently we also added support for reading directly from UCF-101 without re-encoding via the pull request https://github.com/NVIDIA/DALI/pull/1241. It should be available in the nightly and/or weekly build.

suriyachaudary commented 5 years ago

@a-sansanwal great! thanks! is there a way to have a dataloader that iterate over one video at a time in a list of many videos? This is specially useful in case of inference/evaluation. Current loader seems to lack this feature.

a-sansanwal commented 5 years ago

@suriyasingh if you use file_list with labels and with random_shuffle=False(default), the sequence's will be in order and when you detect a label change, you can infer that the next video has begun.

JanuszL commented 5 years ago

Hi, You can provide file_list to the VideoReader with the following format:

filename     label     start_frame    end_frame
file.mp4      0           5               10 
file.mp4      1           11              12

Based on those unique labels you can map the samples that video reader returns with any piece of information you have in your annotation file/files.

JanuszL commented 5 years ago

@suriyasingh - my bad. I have misread the code and this is not possible now. Sorry for the confusion. We have some PoC ready but there are many open questions about the flexibility, like:

@a-sansanwal - maybe the VideoReader could return, with labels, two tensors with start and end of a given sequence?

a-sansanwal commented 5 years ago

@cinjon we have PoC that helps specify valid start+end timestamps and a label associated with it. But as @JanuszL mentioned there are questions about it.

@JanuszL We could return the timestamp of first and last frame in each sequence. But I imagine its more friendly to have DALI read valid timestamps and only return sequences from between the valid timestamps.

JanuszL commented 5 years ago

@a-sansanwal - we can do that. @suriyasingh - do you think it meets your use case?

a-sansanwal commented 5 years ago

@JanuszL its related to @cinjon's request not @suriyasingh And Yes, I can send a PR when I find some time.

cinjon commented 5 years ago

@a-sansanwal What do you mean by valid? Even just returning the first and last timestamp of frame in the sequence, along with the label of what video it was from, would be sufficient. Then I could cross reference that with a side annotation dict to get what the labels should be. This would be super helpful.

a-sansanwal commented 5 years ago

@cinjon By valid I mean that sequences will only be generated from between the specified start+end timestamp. Frames in a video will not be returned if they do not fall between any of the start+end timestamps provided as input to VideoReader, these frames become invalid. In this case there would be no need to return the timestamps since a unique label will be assigned to each clip specified. You can then associate the label with an annotation. As an example, the input is of the form.

filename   label  start   end
file.mp4      1    5.0    10.0
file.mp4      2    15.0   20.0

Returning the first and last timestamp is trivial too, we already know the frame number of first frame of each sequence, number of frames in the sequence and we also know the frame rate of the videos. We just need to multiply frame number and 1/fps and return that from VideoReader.

cinjon commented 5 years ago

I see. That's not necessary for me as my model needs to get inputs from all parts of the video.

The ideal for me is that I can select a number of frames N, an fps F, and some way of specifying the allowed starting frames. By the latter, I don't really need to provide a list; it can be as simple as every Kth frame starting from zero (that's how I am doing it when using images).

This would then yield a batch of <filename, starting_timestamp, frames> where the assumption is that len(frames) = N, the first frame was from time starting_timestamp and the last frame was from starting_timestamp + N/F. From this batch, I can figure out the annotations because I have the filename key as well as the window from where in that video these frames came.

Is something like this on the roadmap / is there a PR somewhere that is near completion? That would be super.

a-sansanwal commented 5 years ago

It can be as simple as every Kth frame starting from zero (that's how I am doing it when using images).

VideoReader supports step and stride parameters check https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#videoreader

cinjon commented 5 years ago

Splendid, those are super useful. So then I should be set as long as I can get the filename back. Is the way to do that through the use of the label like in this demo (https://github.com/NVIDIA/DALI/blob/b3e406bd7b454c8afaf5aa2d0156e1f8774df48c/docs/examples/video/video_label_example.py)?

cinjon commented 5 years ago

Ok, I have this almost working now. The remaining difficulty is yielding what frame number is the data. All of the videos are the same FPS and I can figure out the per-video information by using the label obtained from passing in a file_list of </path/to/sorted/video.mp4 num_sorted_video>. However, I can't figure out the correct label for the returned frames unless I know at least the start (and preferably the end as verification) frame. @a-sansanwal, is this on the agenda? Or is there already a way to do it?

For reference, my workflow is that I have a pytorch data loader DL that defines a series of VideoReader pipes (one per gpu) and uses those to process the videos. I can execute pipes.run() in DL's __getitem__, but I expect that to match up with the index DL receives and there's no guarantee that it does. If I could pass in the index to the pipes, that would work because then I could get back the right frames for each __getitem__ call.

Another approach would be to sidestep the PyTorch dataloader completely and just use the Dali video reader. In that case, the index wouldn't matter, but I still need to get the start and end frame number to align the returned frames with the right data.

Thanks!

(appears that #753 is also about this)

a-sansanwal commented 5 years ago

@cinjon I have change that allows you to get starting frame numbers. You can add the sequence length to get the end frame number. Link I dont plan to send it as a pull request in its current form. I also havent tested this recently, so if it doesnt work as is, it might need minimal changes.

cinjon commented 5 years ago

awesome! ill get on building locally with that adjustment. thanks @a-sansanwal, will report back.

cinjon commented 5 years ago

I tried building by pulling and then filling in your commit, but ran into an error. Have you seen this before?

[ 43%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/resampling_test/resampling_compare_test.cc.o
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
CMake Error at dali_operators_generated_expression_impl_factory_gpu.cu.o.Release.cmake:279 (message):
  Error generating file
  /opt/dali/build-docker-Release-36-10_x86_64/dali/operators/CMakeFiles/dali_operators.dir/expressions/./dali_operators_generated_expression_impl_factory_gpu.cu.o

make[2]: *** [dali/operators/CMakeFiles/dali_operators.dir/expressions/dali_operators_generated_expression_impl_factory_gpu.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 43%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/resampling_test/resampling_impl_cpu_test.cc.o
[ 43%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/resampling_test/separable_cpu_test.cc.o
[ 44%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/resampling_test/separable_impl_test.cc.o
[ 44%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/warp_test/warp_cpu_test.cc.o
[ 44%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/warp_test/warp_transform_test.cc.o
[ 44%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/alloc_test.cc.o
[ 45%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/any_test.cc.o
[ 45%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/block_setup_test.cc.o
[ 45%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/kernel_poc_test.cc.o
[ 45%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/kernel_test.cc.o
[ 46%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/manager_test.cc.o
[ 46%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/scatter_gather_test.cc.o
[ 46%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/scratch_copy_test.cc.o
[ 46%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/scratch_test.cc.o
[ 47%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/static_switch_test.cc.o
[ 47%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/test_data_test.cc.o
[ 47%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/test_utils_test.cc.o
[ 47%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/tuple_test.cc.o
[ 48%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/test/util_test.cc.o
[ 48%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/dali_kernel_test.cc.o
[ 48%] Building CXX object dali/kernels/CMakeFiles/dali_kernel_test.bin.dir/__/test/dali_test_config.cc.o
[ 48%] Linking CXX executable ../python/nvidia/dali/test/dali_kernel_test.bin
[ 48%] Built target dali_kernel_test.bin
make[1]: *** [dali/operators/CMakeFiles/dali_operators.dir/all] Error 2
make: *** [all] Error 2
cinjon commented 5 years ago

This was via the docker approach described here --> https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/compilation.html. I ran it with DALI_BUILD_FLAVOR=nightly-frames PYVER=3.6 CUDA_VERSION=10 ./build.sh

JanuszL commented 5 years ago

This is strange. Can you try to build clean source code from master?

cinjon commented 5 years ago

I stashed the changes that @a-sansanwal suggested and reissued the build command, but ran into another error.

[ 45%] Building NVCC (Device) object dali/operators/CMakeFiles/dali_operators.dir/optical_flow/turing_of/dali_operators_generated_optical_flow_turing.cu.o
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
CMake Error at dali_operators_generated_expression_impl_factory_gpu.cu.o.Release.cmake:279 (message):
  Error generating file
  /opt/dali/build-docker-Release-36-10_x86_64/dali/operators/CMakeFiles/dali_operators.dir/expressions/./dali_operators_generated_expression_impl_factory_gpu.cu.o

make[2]: *** [dali/operators/CMakeFiles/dali_operators.dir/expressions/dali_operators_generated_expression_impl_factory_gpu.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [dali/operators/CMakeFiles/dali_operators.dir/all] Error 2
make: *** [all] Error 2
cinjon commented 5 years ago

The command i used was DALI_BUILD_FLAVOR=nightly-check PYVER=3.6 CUDA_VERSION=10 ./build.sh.