NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.16k stars 622 forks source link

Error with GPU-only Image Decoding in NVIDIA DALI Pipeline #5697

Open aafaqin opened 2 weeks ago

aafaqin commented 2 weeks ago

Describe the question.

I’m encountering an issue while running a DALI pipeline with GPU-only decoding. The pipeline works when the fn.decoders.image operator is set to "mixed" mode, but it fails with device="gpu" mode, throwing an error about incompatible device storage for the input. Here’s the setup and error details:

Code:

class SimplePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, external_data):
        super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
        self.input = fn.external_source(source=external_data, num_outputs=2, dtype=[types.UINT8, types.INT32])

    def define_graph(self):
        self.jpegs, self.labels = self.input
        # This works:
        # self.decode = fn.decoders.image(self.jpegs, device="mixed", output_type=types.RGB)

        # This fails with incompatible device storage error:
        self.decode = fn.decoders.image(self.jpegs, device="gpu", output_type=types.RGB)
        self.resize = fn.resize(self.decode, device="gpu", resize_x=1120, resize_y=640)

        self.cmnp = fn.crop_mirror_normalize(
            self.resize, device="gpu", dtype=types.FLOAT, output_layout="CHW",
            crop=(640, 1120), mean=[0.0, 0.0, 0.0], std=[255.0, 255.0, 255.0]
        )
        return self.cmnp, self.labels

pipe = SimplePipeline(batch_size=batch_size, num_threads=32, device_id=0, external_data=iter)
pipe.build()

Error:

RuntimeError: Assert on "IsCompatibleDevice(dev, inp_dev, op_type)" failed: The input 0 for gpu operator nvidia.dali.fn.decoders.image is stored on incompatible device "cpu". Valid device is "gpu".

GPU and Platform Information:

GPU: NVIDIA RTX 6000 Ada Generation
CUDA Version: 12.2
DALI Version: [specify DALI version if known]
Driver Version: 535.104.05
System: Running in a Docker container with NVIDIA GPU support enabled

CUFile GDS Check: Here are the results from running gdscheck:

plaintext

(base) ➜  tools ./gdscheck -p
warn: error opening log file: Permission denied, logging will be disabled
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 1024
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 18014398509481980
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 1024
 execution.max_request_parallelism : 0
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):65536, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded
(base) ➜  tools 

Additional Notes: The pipeline works when device="mixed" is used for fn.decoders.image, but switching to device="gpu" causes the error. I’m using external data for fn.external_source, which may be causing the device compatibility issue. The goal is to decode directly on the GPU to optimize performance.

Check for duplicates

JanuszL commented 2 weeks ago

Hi @aafaqin,

Thank you for reaching out. You can read more about the meaning of the operator backend here. The mixed backed is used for operators that consume the input from the CPU and produce the output on the GPU. The decoder operator does only support cpu and mixed backends, so the encoded images should be located on the CPU first. There is no variant available that can read data directly from the GPU memory. The rationale is that while most of the decoding process can be performed on the GPU, there are initial stages that need to happen or are just more efficient on the CPU (bitstream parse, Huffman coefficients decoding).

aafaqin commented 2 weeks ago

Hi @JanuszL ,

Thank you for the clarification. I understand now that the mixed mode is essential for handling initial decoding stages on the CPU before GPU processing can take place. Given our intent to enhance performance through GPU utilization, I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow. Could this approach mitigate the need for CPU involvement in the initial decoding steps, or would it be feasible to adjust the pipeline to support such a configuration?

Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?

Looking forward to your insights.

Best regards

JanuszL commented 2 weeks ago

Hi @aafaqin,

I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow.

I'm afraid this is not currently possible as the decoding process requires some work to happen on the CPU first (stream parsing, and, in the case of a hybrid approach, not HW decoding, Huffman coefficients decoding).

Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?

DALI hasn't approached the encoding yet, technically it should be feasible however I'm not sure if the encoded images end up in the CPU or GPU memory. You may try using nvImageCodec for decoding and kvikio for GDS access.

aafaqin commented 1 week ago

Thanks for the help so far on the same code i am trying out different ways like

class SimplePipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id, external_data):
        super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
        # self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32])
        self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32],parallel=True,prefetch_queue_depth=16,batch=True)

    def define_graph(self):
        self.jpegs, self.labels = self.input

        self.decode = fn.decoders.image(self.jpegs,device="mixed", output_type=types.RGB)

        self.resize = fn.resize(self.decode,device="gpu", resize_x=1600, resize_y=1600)
        # self.prem   =    fn.transpose(self.resize, perm=[2,0,1],dtype=types.FLOAT)

        self.cmnp = fn.crop_mirror_normalize(self.resize,device="gpu",
                                             dtype=types.FLOAT,
                                             output_layout="CHW",
                                             crop=(1600,1600),
                                             mean=[0.0,0.0,0.0],
                                             std=[255.0,255.0,255.0])

        return self.cmnp ,self.labels

Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.

JanuszL commented 1 week ago

Hi @aafaqin,

Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.

It means you use only 1 DALI thread (see num_threads value) or the batch size is 1. Can you set num_threads=10 and batch size 256 for example and see if that makes any difference?

aafaqin commented 1 week ago

I've set the num_threads in the DALI pipeline to match the number of CPU cores (64 in my case) and verified the DALI_AFFINITY_MASK. Despite this, I am not seeing any significant performance improvement when increasing the batch size. The average processing speed per image remains unchanged, regardless of adjustments to the batch size.

Do you have any insights on what could be causing this bottleneck? Could it be related to how external inputs are being processed or perhaps the GPU-CPU synchronization? Any suggestions to optimize this further would be greatly appreciated.

JanuszL commented 1 week ago

Can you try capturing the profile of the processing using nsight and see how it looks like/share?