Open dengjiahao12 opened 4 months ago
Hi @dengjiahao12. Thank you for your question.
The best tool I can recommend is nsight systems profiler. You can collect your profile like this:
# (optional) lower paranoid level for profiling (this will give us some extra info for the CPU part of the execution)
echo 1 > /proc/sys/kernel/perf_event_paranoid
# collect your profile with nsys from cuda toolkit
nsys profile --trace=cuda,opengl,nvtx python your_test_script.py
This should give you a profile file that you can load and visualize to see the timeline of your execution. You need to install https://developer.nvidia.com/nsight-systems to open it.
Feel free to send us the profile back and we can have a look and help you figure out what's going on in your case.
Hope that helps.
Describe the question.
A100 hardware decoder: I pulled the
nvcr.io/nvidia/pytorch:23.12-py3
Docker image and created a container. I built the following pipelineI want to test the A100 hardware decoder and analyze why there are significant throughput differences when allocating different ratios of decoding tasks to the hardware decoder. My naive approach was to add
pdb
to inspect program execution. However, when the program reachesself._pipe.RunGPU()
innvidia/dali/pipeline.py
, I can't step intoRunGPU()
.What I want to know is how to analyze why there is a difference in throughput when different ratios of decoding tasks are assigned to the hardware decoder. For example, according to the blogLoading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs, if
75%
of the decoding tasks are assigned to the hardware decoder, the throughput can reach about7000img/sec
.However, if all tasks are assigned to the hardware decoder, the throughput is only about
5000img/sec
. If all decoding is assigned to theA100 GPU
, the throughput is about6000img/sec
. In my own test, when I assigned10%
of the decoding tasks to the hardware decoder(hw_decoder_load=0.1)
, the throughput was only2000img/sec
. I want to know why and how to analyze why this is the case.Check for duplicates