Open aafaqin opened 2 weeks ago
Hi @aafaqin,
Thank you for reaching out.
You can read more about the meaning of the operator backend here. The mixed
backed is used for operators that consume the input from the CPU and produce the output on the GPU. The decoder
operator does only support cpu
and mixed
backends, so the encoded images should be located on the CPU first. There is no variant available that can read data directly from the GPU memory. The rationale is that while most of the decoding process can be performed on the GPU, there are initial stages that need to happen or are just more efficient on the CPU (bitstream parse, Huffman coefficients decoding).
Hi @JanuszL ,
Thank you for the clarification. I understand now that the mixed mode is essential for handling initial decoding stages on the CPU before GPU processing can take place. Given our intent to enhance performance through GPU utilization, I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow. Could this approach mitigate the need for CPU involvement in the initial decoding steps, or would it be feasible to adjust the pipeline to support such a configuration?
Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?
Looking forward to your insights.
Best regards
Hi @aafaqin,
I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow.
I'm afraid this is not currently possible as the decoding process requires some work to happen on the CPU first (stream parsing, and, in the case of a hybrid approach, not HW decoding, Huffman coefficients decoding).
Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?
DALI hasn't approached the encoding yet, technically it should be feasible however I'm not sure if the encoded images end up in the CPU or GPU memory. You may try using nvImageCodec for decoding and kvikio for GDS access.
Thanks for the help so far on the same code i am trying out different ways like
class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, external_data):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
# self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32])
self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32],parallel=True,prefetch_queue_depth=16,batch=True)
def define_graph(self):
self.jpegs, self.labels = self.input
self.decode = fn.decoders.image(self.jpegs,device="mixed", output_type=types.RGB)
self.resize = fn.resize(self.decode,device="gpu", resize_x=1600, resize_y=1600)
# self.prem = fn.transpose(self.resize, perm=[2,0,1],dtype=types.FLOAT)
self.cmnp = fn.crop_mirror_normalize(self.resize,device="gpu",
dtype=types.FLOAT,
output_layout="CHW",
crop=(1600,1600),
mean=[0.0,0.0,0.0],
std=[255.0,255.0,255.0])
return self.cmnp ,self.labels
Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.
Hi @aafaqin,
Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.
It means you use only 1 DALI thread (see num_threads
value) or the batch size is 1. Can you set num_threads=10
and batch size 256 for example and see if that makes any difference?
I've set the num_threads in the DALI pipeline to match the number of CPU cores (64 in my case) and verified the DALI_AFFINITY_MASK. Despite this, I am not seeing any significant performance improvement when increasing the batch size. The average processing speed per image remains unchanged, regardless of adjustments to the batch size.
Do you have any insights on what could be causing this bottleneck? Could it be related to how external inputs are being processed or perhaps the GPU-CPU synchronization? Any suggestions to optimize this further would be greatly appreciated.
Describe the question.
I’m encountering an issue while running a DALI pipeline with GPU-only decoding. The pipeline works when the fn.decoders.image operator is set to "mixed" mode, but it fails with device="gpu" mode, throwing an error about incompatible device storage for the input. Here’s the setup and error details:
Code:
Error:
RuntimeError: Assert on "IsCompatibleDevice(dev, inp_dev, op_type)" failed: The input 0 for gpu operator
nvidia.dali.fn.decoders.image
is stored on incompatible device "cpu". Valid device is "gpu".GPU and Platform Information:
CUFile GDS Check: Here are the results from running gdscheck:
plaintext
Additional Notes: The pipeline works when device="mixed" is used for fn.decoders.image, but switching to device="gpu" causes the error. I’m using external data for fn.external_source, which may be causing the device compatibility issue. The goal is to decode directly on the GPU to optimize performance.
Check for duplicates