NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
4.99k stars 610 forks source link

A100 hardware decoder #5362

Open dengjiahao12 opened 4 months ago

dengjiahao12 commented 4 months ago

Describe the question.

A100 hardware decoder: I pulled the nvcr.io/nvidia/pytorch:23.12-py3 Docker image and created a container. I built the following pipeline

images, _ = fn.readers.file(file_root=image_dir, random_shuffle=True)
images = fn.decoders.image(images, device='mixed', output_type=types.RGB, hw_decoder_load=0.75)

I want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add pdb to inspect program execution. However, when the program reaches self._pipe.RunGPU() in nvidia/dali/pipeline.py, I can't step into RunGPU().

What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if 75% of the decoding tasks are assigned to the hardware decoder, the throughput can reach about 7000img/sec.

However, if all tasks are assigned to the hardware decoder, the throughput is only about 5000img/sec. If all decoding is assigned to the A100 GPU, the throughput is about 6000img/sec. In my own test, when I assigned 10% of the decoding tasks to the hardware decoder (hw_decoder_load=0.1), the throughput was only 2000img/sec. I want to know why and how to analyze why this is the case.

Check for duplicates

jantonguirao commented 4 months ago

Hi @dengjiahao12. Thank you for your question.

The best tool I can recommend is nsight systems profiler. You can collect your profile like this:

# (optional) lower paranoid level for profiling (this will give us some extra info for the CPU part of the execution)
echo 1 > /proc/sys/kernel/perf_event_paranoid

# collect your profile with nsys from cuda toolkit
nsys profile --trace=cuda,opengl,nvtx python your_test_script.py

This should give you a profile file that you can load and visualize to see the timeline of your execution. You need to install https://developer.nvidia.com/nsight-systems to open it.

Feel free to send us the profile back and we can have a look and help you figure out what's going on in your case.

Hope that helps.