Questions regarding design choices

Describe the question.

Hello everyone,

I have a question regarding some design choices when building a video dataset with DALI. My pipeline consists of several steps where some steps happen within DALI pipelines and some steps are normal python code. Specifically, I have a web dataset consisting of video containing tar files, so my first step is to invoke DALI's webdataset reader within a pipeline. Afterwards, I would like to filter out unwanted video files before decoding based on their metadata. Afterwards I invoke a second DALI pipeline for decoding the video files. Then, I process the decoded videos (e.g. cutting them up into smaller snippets and finally forward those to another DALI processing pipeline (e.g., for resizing etc). A dummy code looks something like this:

@pipeline_def()
def wds_extraction(paths):
    raw_video_bytes = fn.readers.webdataset(paths=paths, ...)
    return raw_video_bytes

def filter(source):
    for video_bytes in source:
        duration, fps = get_metadata(video_bytes)
        ...
        yield video_bytes, duration, fps

@pipeline_def()
def decoding(source, device):
    inputs = fn.external_source(source, num_outputs=3) # bytes, duration, fps
    video = fn.experimental.decoders.video(inputs [0], device=device)
    return video, *inputs[1:] # simply forward duration and fps unchanged ...

def cutting_snippets(source):
    ...

@pipeline_def()
def resizing(source):
    fn.external_source(source, ...)
    ...

def iterator(paths):
    source = wds_extraction_iter(paths) # wraps the wds_extraction pipeline in a DALIRaggedIterator
    source = filter(source)
    source = decoding_iter(source) # wraps the decoding pipeline in a DALIRaggedIterator
    source = cutting_snippets(source)
    source = resizing_iter(source) # wraps the resizing pipeline in a DALIRaggedIterator
    yield from source

I wanted to ask whether this design choice is efficient even with the context switches between pure python and DALI pipelines. Are there some disadvantages performance-wise? Another quite bothering thing is that I have to forward each piece of data through every DALI pipeline even though they do not get updated anymore. For example, I extract the duration and fps of each video in the filter method and want to forward them until the end to the user. Hence, I must also load them into the DALI pipelines and simply output them again.

Is there a better way to achieve a pipeline like this?

Check for duplicates

[x] I have searched the open bugs/issues and have found no duplicates for this bug report

NVIDIA / DALI

Questions regarding design choices #5626

Describe the question.

Check for duplicates