NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.09k stars 615 forks source link

Dali Loader using a NamedTuple datatype instead of an array #5539

Open rachelglenn opened 3 months ago

rachelglenn commented 3 months ago

Describe the question.

I am following the example for external input to the dali loader. My datatype going to my model is a NamedTuple. When I try to create the dataloader: dataloader = DALIGenericIterator(pipeline, ["image"]) I get an error associated with my NamedTuple type: TypeError: Illegal pipeline output type. The output 0 contains a nestedDataNode. Missing list/tuple expansion (*) is the likely cause.

I am not sure how the Dali loader can accept a NamedTuple type. Is it possible? I am not sure what to put for the second argument in the creation of the dataloader iterator (DALIGenericIterator).

Thanks for the help.

Check for duplicates

JanuszL commented 3 months ago

Hi @rachelglenn,

Thank you for reaching out. I'm afraid that you may hit the DALI limitation, however, before we rule other issues out please share a simple code snip we can run on our end that will illustrate your approach and reproduce the problem.

rachelglenn commented 3 months ago

Here is what I can put together as an example. I hope that I didn't make any small typos


import cupy as cp
import imageio

class model_data(NamedTuple):
    image: torch.Tensor
    lable: torch.Tensor
    filename: str

class ExternalInputGpuIterator(object):
    def __init__(self, batch_size):
        self.images_dir = "../../data/images/"
        self.batch_size = batch_size
        with open(self.images_dir + "file_list.txt", "r") as f:
            self.files = [line.rstrip() for line in f if line != ""]
        shuffle(self.files)

    def __iter__(self):
        self.i = 0
        self.n = len(self.files)
        return self

    def __next__(self):
        batch = []
        labels = []
        filenames = []
        for _ in range(self.batch_size):
            jpeg_filename, label = self.files[self.i].split(" ")
            im = imageio.imread(self.images_dir + jpeg_filename)
            im = cp.asarray(im)
            im = im * 0.6

            self.i = (self.i + 1) % self.n

           model_data(im.astype(cp.uint8), cp.array([label], dtype=np.uint8), self.files[self.i].split(" "))
           batch.append(model_data)
        return batch

eii_gpu = ExternalInputGpuIterator(batch_size)
pipe_gpu = Pipeline(batch_size=batch_size, num_threads=2, device_id=0)
with pipe_gpu:
    model_data = fn.external_source(source=eii_gpu, device="gpu",  )
    model_data.image= fn.brightness_contrast(model_data.image, contrast=2)
    pipe_gpu.set_outputs(model_data)
train_loader = DALIGenericIterator(pipeline, ["model_data"])
JanuszL commented 3 months ago

Hi @rachelglenn,

Thank you for providing the code snippet. However, I get multiple errors running it. Can you please check it on your end?

rachelglenn commented 3 months ago

Yes, I am not surprised. I am not able to get it to work. This is why I am asking for help of how to use a named Tuple in the datatype for the pipeline. Can you help provide an example using:

class model_data(NamedTuple):
    image: torch.Tensor
    lable: torch.Tensor
    filename: str
JanuszL commented 3 months ago

@rachelglenn,

I get errors not related to the issue you raised, for example:

class model_data(NamedTuple):
NameError: name 'NamedTuple' is not defined

After adding:

from collections import namedtuple
import torch```
I get

class model_data(namedtuple): TypeError: function() argument 'code' must be code, not str


and I'm not sure if I'm running the same code as you anymore. Please update the provided snipped in a way that will show the mentioned error.