Building (video) pipeline slow with high number of samples

NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html

Apache License 2.0

5.17k stars 622 forks source link

Building (video) pipeline slow with high number of samples #4607

Open elmuz opened 1 year ago

elmuz commented 1 year ago

Hello. I noticed that the time required for building the pipeline grows linearly with the number of sample in the video list. I am using more or less this code:

train_loader = DALIGenericIterator(
    pipelines=[
        MyPipeline(
            sample_list_path=video_file_list,  # this is the txt file where each line is 'path label start end'
            shuffle=True,
            batch_size=self.batch_size,
            num_threads=2,
        )
    ],
    ...
)

I noticed that executing the above snippet can take long when number of videos is big. I am referring only to the building time. For example: 1000 samples (rows) in the txt file -> 12.81 sec 2000 samples (rows) in the txt file -> 24.52 sec My dataset is way bigger than that (100x) so this linearly increasing setup time is not a viable solution.

Are these numbers expected in your experience? Maybe I am doing something wrong while configuring the operators... I am using DALI 1.20.0 from official Nvidia 22.12 container.

awolant commented 1 year ago

Hello, thanks for the question. It looks like MyPipeline is a function that constructs a DALI pipeline for you. Is that right? Could you share its code? It would be easier to pinpoint the problem if we now more about your use case especially what are the parameters for the video reader op? Thanks!

elmuz commented 1 year ago

Sure!

@pipeline_def(num_threads=2, device_id=0)
def SpeechPipeline(sample_list_path: Union[str, Path], shuffle: bool = False):
    frames, video_id, frame_num = fn.readers.video(
        name="speech_reader",
        device="gpu",
        file_list=f"{sample_list_path}",
        sequence_length=5,
        step=1,
        random_shuffle=shuffle,
        initial_fill=128,
        file_list_frame_num=True,
        enable_frame_num=True,
        file_list_include_preceding_frame=True,
    )

    return frames, video_id, frame_num

awolant commented 1 year ago

Thanks.

This looks fine. Unfortunately DALI video reader upon pipeline creation needs to create av context for each file to look for the number of frames and other metadata, if present. I think making this behavior better for the large number of files is a legitimate and enhancement request. We are working on some improvements to the video reading capabilities and we definitely should look into this. Other possible solution would be the ability to provide necessary data to the reader, so it can be calculated once and can be reused without the need to parse it during pipeline build.

One thing that comes to my mind as a possible workaround now is to try glue your videos together offline (using FFmpeg or something) into some larger chunks and use sequence_length and step arguments to extract the same samples. This might somewhat improve the situation there but I am not really sure by how much without trying it).

Hope that was at all helpful.

olympiquemarcel commented 2 months ago

Is the pipeline building really independent of the number of workers used and there is no parallelization here to speed this up? I did some scaling tests for building a video pipeline to read 10.000 videos (~300 GB in total) on a system with 4 GPUs per node. These are the results:

n GPUS	Pipeline Building Time	Time per Epoch
4	639 s	3900 s
8	644 s	1920 s
16	643 s	960 s

The PyTorch profiler tells me most of the time is spent here:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1  609.193  609.193  609.193  609.193 {built-in method nvidia.dali.backend_impl.reader_meta}

JanuszL commented 2 months ago

Hi @olympiquemarcel,

Thank you for reaching out. Yes, currently the build process (the file discovery and their metadata read) is single-threaded. The reader_meta method calls many native functions under the hood so the Python side profile doesn't provide the full overview.

olympiquemarcel commented 2 months ago

Hi @JanuszL, thanks for the quick answer. As far as I can see the output of the reader_meta method is a quite simple dict for each of the workers in the form of

{"Reader": {"epoch_size": 2358372, "epoch_size_padded": 2358376, "number_of_shards": 16, "shard_id": 0, "pad_last_batch": 1, "stick_to_shard": 1}}

Would it be enough to save this dict once it has been created and then load it again to avoid the long building time?

JanuszL commented 2 months ago

Hi @olympiquemarcel,

Indeed the output is simple, but to generate it - learn the number of samples sin the data sets, DALI needs to open each file, create libaviutil context (which takes most of the time), and discover the number of frames in each video. Then based on step, stride, and the sequence length it can calculate the number of samples it can generate from each video. Long story short, calling reader_meta makes the video loader fully initialize if it hasn't already. Consecutive calls to reader_meta should be much faster as the reader has been already initialized.