Open elmuz opened 1 year ago
Hello, thanks for the question.
It looks like MyPipeline
is a function that constructs a DALI pipeline for you. Is that right? Could you share its code? It would be easier to pinpoint the problem if we now more about your use case especially what are the parameters for the video reader op? Thanks!
Sure!
@pipeline_def(num_threads=2, device_id=0)
def SpeechPipeline(sample_list_path: Union[str, Path], shuffle: bool = False):
frames, video_id, frame_num = fn.readers.video(
name="speech_reader",
device="gpu",
file_list=f"{sample_list_path}",
sequence_length=5,
step=1,
random_shuffle=shuffle,
initial_fill=128,
file_list_frame_num=True,
enable_frame_num=True,
file_list_include_preceding_frame=True,
)
return frames, video_id, frame_num
Thanks.
This looks fine. Unfortunately DALI video reader upon pipeline creation needs to create av context for each file to look for the number of frames and other metadata, if present. I think making this behavior better for the large number of files is a legitimate and enhancement request. We are working on some improvements to the video reading capabilities and we definitely should look into this. Other possible solution would be the ability to provide necessary data to the reader, so it can be calculated once and can be reused without the need to parse it during pipeline build.
One thing that comes to my mind as a possible workaround now is to try glue your videos together offline (using FFmpeg or something) into some larger chunks and use sequence_length
and step
arguments to extract the same samples. This might somewhat improve the situation there but I am not really sure by how much without trying it).
Hope that was at all helpful.
Is the pipeline building really independent of the number of workers used and there is no parallelization here to speed this up? I did some scaling tests for building a video pipeline to read 10.000 videos (~300 GB in total) on a system with 4 GPUs per node. These are the results:
n GPUS | Pipeline Building Time | Time per Epoch |
---|---|---|
4 | 639 s | 3900 s |
8 | 644 s | 1920 s |
16 | 643 s | 960 s |
The PyTorch profiler tells me most of the time is spent here:
ncalls tottime percall cumtime percall filename:lineno(function)
1 609.193 609.193 609.193 609.193 {built-in method nvidia.dali.backend_impl.reader_meta}
Hi @olympiquemarcel,
Thank you for reaching out. Yes, currently the build process (the file discovery and their metadata read) is single-threaded. The reader_meta
method calls many native functions under the hood so the Python side profile doesn't provide the full overview.
Hi @JanuszL, thanks for the quick answer. As far as I can see the output of the reader_meta
method is a quite simple dict for each of the workers in the form of
{"Reader": {"epoch_size": 2358372, "epoch_size_padded": 2358376, "number_of_shards": 16, "shard_id": 0, "pad_last_batch": 1, "stick_to_shard": 1}}
Would it be enough to save this dict once it has been created and then load it again to avoid the long building time?
Hi @olympiquemarcel,
Indeed the output is simple, but to generate it - learn the number of samples sin the data sets, DALI needs to open each file, create libaviutil context (which takes most of the time), and discover the number of frames in each video. Then based on step, stride, and the sequence length it can calculate the number of samples it can generate from each video.
Long story short, calling reader_meta
makes the video loader fully initialize if it hasn't already. Consecutive calls to reader_meta
should be much faster as the reader has been already initialized.
Hello. I noticed that the time required for building the pipeline grows linearly with the number of sample in the video list. I am using more or less this code:
I noticed that executing the above snippet can take long when number of videos is big. I am referring only to the building time. For example: 1000 samples (rows) in the txt file -> 12.81 sec 2000 samples (rows) in the txt file -> 24.52 sec My dataset is way bigger than that (100x) so this linearly increasing setup time is not a viable solution.
Are these numbers expected in your experience? Maybe I am doing something wrong while configuring the operators... I am using DALI 1.20.0 from official Nvidia 22.12 container.