How to integrate with PyTorch DataLoaders

grez72 commented 2 weeks ago

Describe the question.

Hi,

I'm hoping to integrate nvImageCode with PyTorch DataLoaders (torch utils DataLoader, or FFCV DataLoader, or LitData DataLoader), but I'm struggling.

If I include the decoder as a transform to be used in my dataset.__getitem__ method, I get the dreaded cudaErrorInitializationError: RuntimeError: Unhandled CUDA error: cudaErrorInitializationError initialization error

class CustomDataSet(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.decoder = nvimgcodec.Decoder() # cudaErrorInitializationError
    def __getitem__(self, idx):
        sample  = super().__getitem__(idx)
        sample['image'] = self.decoder(sample['image'])
        return sample

I can have my dataset return the raw image bytes, and apply the decoder to the list of bytes which is fast, but then I have to loop over items to transform them into pytorch tensors which is slow because it operates over the entire batch sequentially (not in parallel workers). This single step is slow enough that it negates the advantage of using the nvimgcodec.Decoder().

class CustomDataSet(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    def __getitem__(self, idx):
        sample  = super().__getitem__(idx)
        return sample

dataset = CustomDataSet(...)
dataloader = DataLoader(dataset, ...)

for batch in tqdm(dataloader):
    imgs = decoder.decode(batch['image'])
    imgs = [torch.tensor(img).moveaxis(-1,0) for img in imgs] # need to do this in the worker processes

I also tried to have my dataset return decode sources with ROIs, but this fails because DecodeSource is not pickleable.

class CustomDataset(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __getitem__(self, idx):
        sample  = super().__getitem__(idx) # <- Whatever you returned from the DatasetOptimizer prepare_item method.
        roi = nvimgcodec.Region(0, 0, 224, 224) # replace with random crop later...
        sample['image'] = nvimgcodec.DecodeSource(sample['image'], roi)
        return sample

In any case, I've checked open bugs/issues, and the docs, and I can't find a good example of using nvimgcodec in the context of a dataloader with parallel workers. Any guidance or suggestions for how to handle this would be greatly appreciated.

Check for duplicates

[x] I have searched the open bugs/issues and have found no duplicates for this bug report

jantonguirao commented 2 weeks ago

Running multiple CUDA contexts (as it will happen as you are running PyTorch data loaders on separate processes) will not provide a good performance. We are currently working on supporting free-threaded Python (https://docs.python.org/3/howto/free-threading-python.html) which will allow us from running samples from separate threads (not processes), sharing a single CUDA context.

We are also working on an alternative solution that doesn't require free-threaded Python and that it'll allow to run multi-process data loaders while keeping the GPU accelerated processing on a single process. We will let you know once we have something to test.

That being said, I believe it should not fail with cudaErrorInitializationError. I believe the issue might be because the decoder instance is being created at init and then transferred to a separate process (this is just a guess). Can you try to move the initialisation of the decoder to the first use so we are sure it gets initialized for each worker? Something like this:

class CustomDataSet(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.decoder = None # do not initialize here
    def __getitem__(self, idx):
        sample  = super().__getitem__(idx)
        if self.decoder is None:
            self.decoder = nvimgcodec.Decoder()
        sample['image'] = self.decoder(sample['image'])
        return sample

Harry-675 commented 1 week ago

Same question，and the suggestion doesn't work. @jantonguirao

jantonguirao commented 1 week ago

@grez72 @Harry-675 To investigate this further, I'd have to look at a full code sample. Can you provide a minimal reproduction script? Thanks

NVIDIA / nvImageCodec

How to integrate with PyTorch DataLoaders #23

Describe the question.

Check for duplicates