Using the External Source operator for video sequences

anibali commented 3 years ago

I have a data loading requirements that do not fit into the "an example is a file on disk" type structure that Dali seems to assume natively. Essentially I have image files (JPEGs) and crops within them that define "clips" (sequences of frames) which are cropped around particular people of interest in the frames (there can be multiple crops in the same images, leading to multiple examples). In case you were curious, the particular use case is multi-person tracking.

My first attempt at using Dali was to define my own ExternalInputIterator like the one shown in the tutorials (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/external_input.html), using the "batch dimension" as time (so each "batch" is actually a single clip). However, this causes issues when I try to use Dali's random augmentation capabilities (each frame in the clip is transformed separately). It also means that I can't easily batch multiple clips since I am already (ab)using batching for another purpose. Here's a simplified version of what I currently have:

class ExternalInputIterator:
    def __init__(self, data_index: DataIndex, sampler: Sampler):
        # `data_index` has metadata about each example in the dataset.
        self.data_index = data_index
        # `sampler` allows for custom shuffling of data.
        self.sampler = sampler

    def __iter__(self):
        self.isampler = iter(self.sampler)
        return self

    def __next__(self):
        example_info = self.data_index[next(self.isampler)]
        image_data = []
        mats = []
        for i, file_name in enumerate(example_info['image_path']):
            with open(os.path.join(file_name), 'rb') as f:
                image_data.append(np.frombuffer(f.read(), dtype=np.uint8))
            mats.append(get_crop_matrix(example_info))
        # Return our "batch" which actually represents a single clip.
        return (image_data, mats)

Is there a way of using External Source (external_source) for the case where each individual example is a sequence of images? I've looked through the documentation and the issues here, but couldn't find anything. I also thought long and hard about how to solve this based on what I've read, but the input-side of Dali feels rather inflexible and is steeped in C++ so I can't easily take fn.readers.sequence and modify it (for example). The best solution I can think of at the moment is decoding the JPEGs on the CPU as part of the ExternalInputIterator, but I'd prefer to do this on the GPU as part of the pipeline. (EDIT: this doesn't actually work anyway since warp_affine doesn't support applying per-frame warps to video sequences, see https://github.com/NVIDIA/DALI/issues/2832)

EDIT: I've also tried padding the encoded JPEG data so that I can create one big numpy array from all of the frames in the clip, but it seems that dali.fn.decoders.image does not recognise the multiple images and only decodes the first.

JanuszL commented 3 years ago

Hi @anibali,

DALI treats sequences as video frames, which means that they should have a uniform size (which may not be applicable in your case), and all transformations are applied uniformly across the frames in the sequence inside one sample.

However, this causes issues when I try to use Dali's random augmentation capabilities (each frame in the clip is transformed separately).

What you can do (depending on what random augmentation you use) is to try out the permute_batch operator. If you use a random generator to drive other operations you need to use permute_batch to duplicate one value across all samples (so the randomness is applied the same way for each sample).

If you can provide a self-contained example we can run on our side we can provide more suggestions.

anibali commented 3 years ago

Thanks for the prompt response! I really appreciate how communicative you are in this project (it definitely made browsing past issues much more fruitful).

DALI treats sequences as video frames, which means that they should have a uniform size (which may not be applicable in your case)

This does indeed apply for my case, and seems like a reasonable assumption to me.

and all transformations are applied uniformly across the frames in the sequence inside one sample.

This is a problem for one type of augmentation that I have in mind (simulated camera movement). It would also definitely be a problem for cases where the crop "tracks" the subject (not the case for my current project). I'm not sure if it's an unavoidable technical limitation, but disallowing per-frame transformation is a very big restriction that would have caused major issues for me in past projects.

What you can do (depending on what random augmentation you use) is to try out the permute_batch operator. If you use a random generator to drive other operations you need to use permute_batch to duplicate one value across all samples (so the randomness is applied the same way for each sample).

permute_batch sounds like it might work for my current setup (sequence as a batch of images), I'll give it a go.

JanuszL commented 3 years ago

Hi @anibali,

I'm not sure if it's an unavoidable technical limitation, but disallowing per-frame transformation is a very big restriction that would have caused major issues for me in past projects.

It is rather a strong limitation. As I explained we assume that the sequence is a sample that should be uniformly transformed across the frames. If you want to have a different transformation per frame then I would treat a sequence as a batch of separate frames.

permute_batch sounds like it might work for my current setup (sequence as a batch of images), I'll give it a go.

I'm looking forward to hearing more about your results.

anibali commented 3 years ago

I can confirm that permute_batch worked for replicating randomly generated numbers such that they are shared for images belonging to the same clip. To make things easier I wrote a little helper:

class PerClipRng:
    """An NVIDIA Dali helper for generating per-clip random numbers.

    Assuming that "batches" in the pipeline have the following layout
        [A1, A2, ..., An, B1, B2, ... Bn, C1, ...]
    it is guaranteed that the random numbers generated will be the same for each image in a clip
    (e.g. A1, A2, ..., An will all have the same value).
    """
    def __init__(self, clips_per_batch, images_per_clip):
        self.clips_per_batch = clips_per_batch
        self.images_per_clip = images_per_clip

    def _repeat_per_clip(self, batch):
        indices = list(np.repeat(np.arange(self.clips_per_batch), self.images_per_clip))
        batch_replicated = dali.fn.permute_batch(batch, indices=indices)
        return batch_replicated

    def coin_flip(self, probability=None):
        batch = dali.fn.random.coin_flip(probability=probability)
        return self._repeat_per_clip(batch)

    def normal(self, mean=None, stddev=None):
        batch = dali.fn.random.normal(mean=mean, stddev=stddev)
        return self._repeat_per_clip(batch)

    def uniform(self, range=None):
        batch = dali.fn.random.uniform(range=range)
        return self._repeat_per_clip(batch)

I still think that it would be nice if there was a way to have the external source produce video sequences as opposed to taking the batch-of-images approach, but I'm going to mark this issue as resolved. Thanks for your help.

JanuszL commented 3 years ago

Hi @anibali,

If you have a uniform in size frames you can do something like this as well:

import numpy as np
import nvidia.dali.fn as fn
from nvidia.dali import pipeline_def
import os

batch_size = 10
sequence_length = 4

test_data_root = os.environ['DALI_EXTRA_PATH']
jpeg_file = os.path.join(test_data_root, 'db', 'single', 'jpeg', '510', 'ship-1083562_640.jpg')

def get_data(sample_info):
    # just an example which repeats the same frame, but you can put different frames there. If images are encoded you need
    # to zero pad all of them. Encoded JPEG has its size in the header to trailing zeros are natural
    out = [np.fromfile(jpeg_file, dtype=np.uint8) for _ in range(sequence_length)]
    # add label
    out.append(np.array([1,2,3]))
    return out

@pipeline_def
def simple_pipeline():
    *jpegs, label = fn.external_source(source=get_data, num_outputs=sequence_length+1, parallel=True, batch=False)
    images = fn.decoders.image(jpegs, device="mixed", hw_decoder_load=1)
    sequence = fn.stack(*images)
    sequence = fn.reshape(sequence, layout="DHWC")
    return sequence, label

pipe = simple_pipeline(batch_size=batch_size, num_threads=4, prefetch_queue_depth=2, device_id=0)
pipe.build()
pipe.run()
out = pipe.run()
print(np.array(out[0][0]).shape)
print(np.array(out[1][0]))

anibali commented 3 years ago

Ah, that's a really good example---I didn't realise that you could pass a list as input to fn.decoders.image!

JanuszL commented 3 years ago

Please keep in mind that passing a list to the operator creates a len number of its instances. In the case of the mixed decoder, it will make it allocate GPU memory for each of it and with a bigger sequence length, you can just run out of it.

NVIDIA / DALI

Using the External Source operator for video sequences #3433