Open grez72 opened 2 weeks ago
Running multiple CUDA contexts (as it will happen as you are running PyTorch data loaders on separate processes) will not provide a good performance. We are currently working on supporting free-threaded Python (https://docs.python.org/3/howto/free-threading-python.html) which will allow us from running samples from separate threads (not processes), sharing a single CUDA context.
We are also working on an alternative solution that doesn't require free-threaded Python and that it'll allow to run multi-process data loaders while keeping the GPU accelerated processing on a single process. We will let you know once we have something to test.
That being said, I believe it should not fail with cudaErrorInitializationError
. I believe the issue might be because the decoder instance is being created at init and then transferred to a separate process (this is just a guess). Can you try to move the initialisation of the decoder to the first use so we are sure it gets initialized for each worker? Something like this:
class CustomDataSet(StreamingDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.decoder = None # do not initialize here
def __getitem__(self, idx):
sample = super().__getitem__(idx)
if self.decoder is None:
self.decoder = nvimgcodec.Decoder()
sample['image'] = self.decoder(sample['image'])
return sample
Same question,and the suggestion doesn't work. @jantonguirao
@grez72 @Harry-675 To investigate this further, I'd have to look at a full code sample. Can you provide a minimal reproduction script? Thanks
Describe the question.
Hi,
I'm hoping to integrate nvImageCode with PyTorch DataLoaders (torch utils DataLoader, or FFCV DataLoader, or LitData DataLoader), but I'm struggling.
If I include the decoder as a transform to be used in my
dataset.__getitem__
method, I get the dreaded cudaErrorInitializationError: RuntimeError: Unhandled CUDA error: cudaErrorInitializationError initialization errorI can have my dataset return the raw image bytes, and apply the decoder to the list of bytes which is fast, but then I have to loop over items to transform them into pytorch tensors which is slow because it operates over the entire batch sequentially (not in parallel workers). This single step is slow enough that it negates the advantage of using the nvimgcodec.Decoder().
I also tried to have my dataset return decode sources with ROIs, but this fails because DecodeSource is not pickleable.
In any case, I've checked open bugs/issues, and the docs, and I can't find a good example of using nvimgcodec in the context of a dataloader with parallel workers. Any guidance or suggestions for how to handle this would be greatly appreciated.
Check for duplicates