NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.14k stars 620 forks source link

Is VideoReader compatible with Resize? #1247

Closed predrag12 closed 4 years ago

predrag12 commented 5 years ago

Hello,

Trying to use Resize after VideoReader on a GPU for inferencing:

self.input = ops.VideoReader(device="gpu", filenames="sintel_trailer-720p.mp4, ...) self.resize = ops.Resize(device="gpu", resize_x=..., resize_y=...) ... output = self.resize(self.input())

Results in an error:

Assert on "input.GetLayout() == DALI_NHWC" failed: Resize expects interleaved channel layout (NHWC)

VideoReader has scale param but it doesn't offer all the config options like Resize, and neither of the samples like video_label_example or optical flow have resizing.

Alternatively is DALI compatible with NVDECODER GPU->GPU in any form, if yes, is there an example?

Thank you.

awolant commented 5 years ago

Hi, thanks for the question. For now our Resize op is not supporting video files.

Could you explain what you mean by:

Alternatively is DALI compatible with NVDECODER GPU->GPU in any form, if yes, is there an example?

predrag12 commented 5 years ago

Hi,

I was hopping to avoid GPU->CPU-GPU transitions between video decoder and resizing and other transform operations. If VideoReader is not compatible with DALI Resize, is there any intermediate repackaging GPU operator that can be used between the two or is going through CPU memory the only way, and if not, is there a sample (c/c++, py) where underlying nvenc decoder can directly pass GPU frames to Resize and other DALI operators?

Thank you.

awolant commented 5 years ago

Ok, so to copy DALI outputs directly to the GPU you can use copy_to_external. I did some small python example on how to use it with PyCUDA in here. This should allow you to use some custom Python code to resize the frames outside DALI pipeline. Other option would be to write your own custom resize op for videos using our plugin system. This is more work definitely, but this would allow you to resize the videos inside the pipeline, if that is what you want.

predrag12 commented 5 years ago

The intent is to keep all the video related operations on the GPU, and resizing outside DALI pipeline in python on CPU would likely be too slow. Maybe an inverse (copy_from_external) where external source can share its GPU data directly with DALI pipeline, in this case video file or stream -> resize and rest of pipeline. As far as the custom operator options go:

@awolant could you comment on options above, btw, seems little odd that Resize would not be made compatible with VideoReader since resizing is a very common operation and setting the sequence_length=1 (F=1) doesn't help.

Thank you.

dave-andersen commented 5 years ago

dali folks, for those of us who don't care about sequences, is there a possibility of a Reshape that can be used to go from NFHWC to NHWC, and thus become compatible with the non-sequence ops?

awolant commented 5 years ago

My intention with the above copy_to_external was to resize it outside of the DALI pipeline in Python, but on the GPU using some kind of library or custom native code.

About your proposed solutions:

As you pointed out, extending existing Resize op is the best option by far.

In general, DALI video support is in it's rather initial phase an we are working on improving it. We will take requests like this into consideration.

dave-andersen commented 5 years ago

Thanks, Albert. My comment on Reshape to drop F is that I'm happy with just having a batch dimension, as I'm processing each frame independently -- basically just using DALI as a very efficient loader to get data into pytorch/tensorflow. And it's not necessarily the case that the F dimension would be gone forever -- one could reshape from NFHWC to NHWC, do per-frame transforms, and then reshape back to NFHWC. There's only a small subset of processing ops that require sequences (optical flow, etc.).

predrag12 commented 5 years ago

Hi @awolant,

The motive here is to use DALI within the GPU bound video inference pipeline other than optic flow, whether it is from video file or streaming. For the video file it would be natural to use VideoReader and Resize and treat consecutive frames as batch of images if sequence is set to 1. Similarly for video streaming as batch of images, would require to replace the source with something else other than VideoReader like (nv)decoder, but still be able to pass GPU memory from source to rest of DALI pipeline.

For VideoReader, if F=1 wouldn't the N>1 contain consecutive frames and formally be compatible with Resize which does expect NHWC although its receiving N1HWC. Is the data packed the same just that the descriptor is confusing the resize?

For ExternalSource, it is listed as under https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/supported_ops.html#externalsource as both CPU and GPU, are there any GPU samples?

So the question is going any of the routes above, either modify or create new Resize or creating new ExternalSource, are there any more advanced custom operator c/c++ samples that exemplify some of the data operations between input and output rather than just copy?

Thank you.

aalugore commented 5 years ago

Hi @awolant,

I was wondering why the decision to add the 'F' dimension to the video files? Having this restriction makes the filter difficult to use in the general case. In theory, and my own naïveté, each frame of a video could be treated as a separate image making the 'F' dimension feel a bit redundant. If there is a specific reason for it, could you point us to the DALI source code where that dimension is being used? We are toying with the option of altering the source code of the Resize filter to handle the transformation between NFHWC <-> NHWC so maybe you could help point us in the right direction.

Additionally, I noticed you mentioned ExternalSource does not accept input on the gpu despite the documentation @predrag12 cited above saying it can be run on the GPU. Is this an error in documentation? Or is "accepting input on the GPU" and "being able to be scheduled on the GPU", as the documentation states, two different things?

Lastly, I was hoping you could point us to more interesting examples of ExternalSource and/or Resize being used with GPUs if that exists? Or perhaps just the area of DALI source code where we would need to look into if we were indeed going to extend video reading functionality.

Thanks in advance!

JanuszL commented 5 years ago

Hi, ExternalSource accepts input only on the CPU (via numpy array) but the output could be on the GPU or the CPU - that what is in the documentation (maybe not 100% self-explanatory). One of the examples is in https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/util/external_source_test.cc. However, it should not be difficult to extend it to accept GPU inputs using cupy array, cuda_array or DLPack interface. Currently, there aren't many operators that support sequences - but you can check ElementExtract. Regarding sequences - yes, from the processing point of view there is just F * batch size images. On the other hand, some transformations could benefit by grouping F consecutive images into once bag of processing kernels - like some coefficient (during resize) could be reused for the whole sequence saving the storage and compute time.

predrag12 commented 5 years ago

Hi @JanuszL,

Thanks for the link. Are there some tutorials how to make either GPU (output) ExternalSource or source operator in c/c++, https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/util/external_source_test.cc could use some comments in that regard? The custom operator sample https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/extend/create_a_custom_operator.html is also rather oversimplified to learn any real data transformations between input and output.

The problem with sequences is that there are very few operators that support it, OpticalFlow that I am aware of, and the rest expect batches, so it breaks the compatibility between operators, like with VideoReader->Resize when the video is treated as continuous stream of frames. Does applying ExtractElement transform NFHWC -> NHWC?

Regarding the Resize there is the (input) data itself and data descriptor (shape). Would setting the F=1 on VideoReader produce N1HWC as same packed data as NHWC? If yes then only the line of Resize code that asserts the shape would need to be changed to accept N1HWC. If not then N1HWC needs to be repackaged to NHWC, is there some information how NFHWC is being packed in GPU memory?

Thank you.

JanuszL commented 5 years ago

Hi, To make the output of ExternalSource available on the GPU just write something like this:

  self.input = ops.ExternalSource(device='GPU')

Then the rest is the same as in https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/external_input.html. Regarding breaking ht compatibility - yes and no. The sequence of frames by definition is not compatible with the single images, even if you can treat them as the flat sequence it won't work as it is f*batch size of images instead of batch size. In the case of ExtractElement is can return more than one frame per element and it returns sequences - so it won't work either.

Regarding the Resize there is the (input) data itself and data descriptor (shape). Would setting the F=1 on VideoReader produce N1HWC as the same packed data as NHWC? If yes then only the line of Resize code that asserts the shape would need to be changed to accept N1HWC. If not then N1HWC needs to be repackaged to NHWC, is there some information about how NFHWC is being packed in GPU memory?

Theoretically, it could work - frames should be placed continuously in the memory, but it is rather a hack to make it working with a number of frames 1. Readjusting resize to work with sequences should not be that difficult I guess. @mzient could provide guidance if you want to jump into the code.

predrag12 commented 5 years ago

Hi @JanuszL,

The result of the ExtractElement is sill not clear in terms of current Resize absorbing it. Frame is technically sequence of one and mulitplied by batch is NHWC. So when VideoReader sequence is set to 1 or ExtractFrame element is set to 1, what do they produce in terms of N F H W C descriptor and what do they produce in terms of GPU memory tensor packaging? Is the descriptor the problem for Resize or data memory packing, and is there a good c/c++ sample that exemplifies DALI GPU data access or transformations?

Thank you.

JanuszL commented 5 years ago

@predrag12 - it is a problem that resize doesn't support sequences for now, even if there is only one frame inside - @mzient, am I correct. Regarding C++ you can look into our tests, but we don't officially support native C++ API which is still under heavy reworking. So we see no point in pointless effort of providing and maintaining C++ examples that are obsolete release to release. What we want to do is to finish reworking that is still in progress so we can think about the C++ API we want to commit to supporting in the long term.

syb0rg commented 4 years ago

@JanuszL You mentioned above that "Readjusting resize to work with sequences should not be that difficult" Has any further work been done on this recently? I've been attempting to create a custom operator for this and am currently stuck re-implementing ResizeBase::RunGPU(); I am just wondering if I should continue independently or work with you to implement it officially?

JanuszL commented 4 years ago

@syb0rg - we haven't done any progress as we are pursuing other goals now. If you have any specific question how you should progress with your implementation feel free to ask.

syb0rg commented 4 years ago

@JanuszL Sure, my C++ days are a bit rusty and I'm not super familiar with the library's structure yet so that will be super helpful!

Right now I can get my custom operator to hit the ResizeBase::RunGPU method, then it runs into some issues with the number of dimensions. I can't merely change all the 3s to 4s in the file as I'll get compilation errors. I'm thinking I change the in_view to handle 4, allow SubdivideInput() to run to get the minibatches, then I need to split the video up by their frames to reduce the dimension cardinality to 3 and process the resize like normal. Afterwards I would regroup the frames into the original video and return the output. Thoughts? Recommendations? I can attach what work I have right now too if that would help you understand.

mzient commented 4 years ago

My idea would be to change the meaning of minibatches to allow them to represent a bunch of frames (possibly from different videos in the batch) - TensorListView can contain images from multiple samples (tensors). As the CPU variant goes, you could just resize the kernel manager or interleave setup/run. We'll probably do something similar on our own, but the timing depends on many factors.

predrag12 commented 4 years ago

Are there any tangible changes planned to consolidate the unfortunate discrepancy/inconsistency between treatment of series of image files vs. series of video frames (technically still images) either in official VideoReader or Resize operator? If affirmative can you please comment on the timeline, if negative, since this was raised 7MO ago and was mentioned that "Readjusting resize to work with sequences should not be that difficult", could you please provide C/C++ snippet that handles F in NFHWC and directly illustrates your point? Referring to the GPU end to end inference case.

JanuszL commented 4 years ago

Are there any tangible changes planned to consolidate the unfortunate discrepancy/inconsistency between treatment of series of image files vs. series of video frames (technically still images) either in official VideoReader or Resize operator?

I don't think there is any discrepancy/inconsistency. For the sequence of frames, you need to resize a set of consecutive frames within the samples with the same scale, but the scale may vary sample to sample in the batch. In the case of single images, each image may have a different scale. It looks similar but the details differ. We plan to implement it in the near feature but now we are focusing on the audio operators, again I cannot commit to any particular data as our priorities depend on many factors. As said, if you are in a rush you can try to do it on your own and we would be more than happy to accept and review any PR that would implement this functionality.

predrag12 commented 4 years ago

It is inconsistent in a sense that whether the image in memory comes from a file or from video, resize filter or in your case operator, should treat width and height dimension in the same way, and treat other dimensions like N and F as just matter of repetition since there is no dependence btw consecutive frames, particularly for video where resolution is basically fixed. One would hardly call 7MO wait for this "rushing" it. Since you guys mentioned above that this is not a difficult adjustment, one would expect that this can or should be done organically. If you need external help then that would be predicated on having a good tutorial on how to write more complex operators other than the simplest one mentioned in the documentation. Or as repeatedly requested earlier a short C/C++ snipped that would exemplify how to approach this problem in unpacking and packing dimensions in the resize source code and am I sure people would be more than happy to at least start with that.

JanuszL commented 4 years ago

We really appreciate that the community around DALI is getting more and more engaged. We will bump up the priority of this request and we will keep you posted.

predrag12 commented 4 years ago

There are most likely two avenues here since NF are temporal dimensions and CHW are spatial. One Resize handling F, and second VideoReader handling F 0 or -1 by eliminating F dimension altogether and then resize could handle NCHW without modification. I would personally prefer the VideoReader change, not sure if you would want to make official eventually or not, but that would certainly work even unofficial for our purposes. If you could point to the lines of VideoReader code that would need modification and how to handle multiple dimensions both in terms of manipulating data blob and emitting data format, that may be a very helpful. Thanks.

JanuszL commented 4 years ago

One Resize handling F

So you want to interpolate frames. I don't know it that going to work. In the end, you would end up with ghosting and some strange temporal artifacts. If you want to interpolate frames you need more than simple resize algorithm.

second VideoSource handling F 0 or -1 by eliminating F dimension altogether and then resize could handle NCHW without modification

You want to return a single frame instead of sequence and work with that? If so you need to look into https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/loader/video_loader.cc#L710 and https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/video_reader_op.h#L86 for the beginning.

predrag12 commented 4 years ago

Didn't mean to interpolate frames but have only one temporal dimension emitted by VideoReader that Resize (as is) can understand as batch of frames. Word sequence is overloaded here since colloquially it means series of samples but in documentation it means F not N. Resize can currently handle one sequencing/temporal dimension N (batch of N samples/frames), it expects F to be implicitly 1 or rather nonexistent as dimension in the input data blob and data descriptor. VideoReader on the other hand has batch of sequences N x F, so to make its output compatible with Resize, mentioned passing F=0 or -1 as a VideoReader param as a means for it to know not to make NFCHW or N1CHW just NCHW. N and F would be collapsed into one and Resize reads it as N, something like numpy squeeze, which makes sense in batch inference (with output size N). I am not sure though how frames data is packaged and does it equate to contiguous blob over all dimensions and just the descriptor needs to change, or the data needs to be rearranged too? For example changing https://github.com/NVIDIA/DALI/blob/b152caa7f41e1434009af976df68e482208340a2/dali/operators/reader/video_reader_op.h#L104 to "NHWC"

JanuszL commented 4 years ago

Squeezing the dimension won't work as resize expects a batch size number of samples. If you squeeze you will get F*batch size. With the current architecture, you need to adjust resize.

JanuszL commented 4 years ago

You can always take a look at https://github.com/NVIDIA/DALI/issues/1740#issuecomment-590891810 - someone managed to get a good result.

predrag12 commented 4 years ago

I don't think FxN would be problem if one cares only about a batch of images to feed to inference (not training) and doesn't care about sequence since there is no correlation btw image frames, for example batch of 10 frames would be 10HWC or rather 10CHW before fed to inference. ExternalSource is able to produce the NHWC that is compatible with Resize, why would VideoReader not be able to produce the same, with mods? Similar questions related to video sampling both spatial and temporal were brought up in #1183, #1356, #1405, #1478, #1770, #1825, #2069.

So the question is if one wants to send 10HWC from VideoReader that Resize would receive, is it doable to

JanuszL commented 4 years ago

I don't think FxN would be problem if one cares only about a batch of images to feed to inference (not training) and doesn't care about sequence since there is no correlation btw image frames, for example batch of 10 frames would be 10HWC or rather 10CHW before fed to inference. ExternalSource is able to produce the NHWC that is compatible with Resize, why would VideoReader not be able to produce the same, with mods? Similar questions related to video sampling both spatial and temporal were brought up in #1183, #1356, #1405, #1478, #1770

As I understand correctly you want the VideoReader to produce a batch of frames instead of sequences. Is that correct? If so, then bellow references are still valid:

predrag12 commented 4 years ago

Correct, batch of frames that Resize and inference engine understands. Looking at the code links, there is not much in terms of code comments, do you have any guidance as to which lines of code to change, any comments on the approach suggested above (below)?

JanuszL commented 4 years ago

do you have any guidance as to which lines of code to change, any comments on the approach suggested above (below)?

Nothing particular, only the reference where to start I have shared. We would need to dig into the code to figure out how to do it properly.

bartwojcik commented 4 years ago

When is this expected to land? I must say that the lack of image operations support for videos (especially resize) is rather disappointing. As others have pointed out, conceptually this is simple for per-frame operations and should be possible to implement with a simple reshape-like operation, isn't it?

JanuszL commented 4 years ago

When is this expected to land?

Couple of weeks. Still, we encourage the community to take a stab and help in the development.

aalugore commented 4 years ago

Hello, I was reading through this again and see that it hasn't been updated in a while. I understand time frame's may have changed with the state of the world, but I was wondering if any internal progress has been made or if we have a new outlook on when this could be addressed?

JanuszL commented 4 years ago

Hi, We are still pursuing other requests and priorities. We hope to get into that soon.

predrag12 commented 4 years ago

Hi,

Could you please be a little bit more specific as to the 'soon' ETA? Just for reference this issue has been opened entering 9MO now and was indicated 1.5MO back as 'couple of weeks', and indicated that this 'should not be that difficult'. There are number of participants on this thread and is linked by number of issues with same/similar request, i.e. compatibility btw video and resize operators. Asking the community to contribute to this would be predicted on comprehensive documentation and samples and recommendations closely related to how data blob is packed and descriptor etc., neither of which are available, and if it is simple and would take more time to document, maybe could be done internally. Can you please comment/recommend which VideoReader code lines would be subject to change and how, in order to have VideoReader be compatible with Resize (as is) and inference downstream?

Thanks.

JanuszL commented 4 years ago

Hi, DALI is an open-source project, however, we have many drivers and stakeholders, not necessarily visible on GitHub. We are doing our best to meet the expectations of as many customers and users as possible, but we cannot make everyone happy. Now. we are fully focusing on providing comprehensive audio support. I have already seen that someone managed, with a little help from our side. to implement this on his own: https://github.com/NVIDIA/DALI/issues/1740.

predrag12 commented 4 years ago

So there is no commitment to any timeline at all or providing sufficient documentation or sample or recommendations specifically closely related to subject of the issue? As a summary/reminder the ask here is not a custom resize operator but how to fix the VideoReader to be compatible with existing Resize and rest of the inference pipeline.

awolant commented 4 years ago

Hi @predrag12 , recently we introduced VideoReaderResize op that can resize videos. It supports all Resize parameters for videos. It is available in our nightly build.

JanuszL commented 4 years ago

0.25 was released. The requested feature is implemented there.

predrag12 commented 4 years ago

Unfortunately this is not meeting the requirement that was (repeatedly) requested in the issue during entire year, the ask was to optionally via params eliminate video sequence (F) and leave batch (N), not to combine two operators and still have mandatory sequence F on the output of VideoReader which breaks inference. Also the ask was to have some documentation or comments in the code so that VideoReader could be modified more easily.

JanuszL commented 4 years ago

the requirement that was (repeatedly) requested in the issue during entire year, the ask was to optionally via params eliminate video sequence (F) and leave batch (N

I think it was stated that [Reshape)[https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.Reshape] the sequence length of 1 or [ElementExtract][(https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.ops.ElementExtract) should do what you want.