Profiling NVIDIA DALI Code using Nsight

NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html

Apache License 2.0

5.11k stars 617 forks source link

Profiling NVIDIA DALI Code using Nsight #2790

Closed dariodematties closed 3 years ago

dariodematties commented 3 years ago

Dear Community, I am using DALI for my postdoctoral project.

I am running on multiple GPUs with MPI.

I have 24 nodes with 8 GPUs each but for the moment I am just using one node since I want to make my code as efficient as possible before pursuing large runs.

I am not using DALI in a conventional way I think since I do not bring and augment all the dataset before feeding it to a network.

Instead, I bring a batch with a pipe and produce several augmentaions of such batch with another pipe while feeding the augmented images to a resnet.

It is more kind of an interactive execution between a network and DALI.

I have noticed an irregular behavior in my code since some times the code runs very fast and some times it is really slow.

That's why I want to profile it.

I am asking if you are interested and can help me to profile my code.

Nsight produces a file per process which I copy in my local machine and analyze it using a visualisation and analysis tool by NVIDIA Nsight

Maybe you are interested about analyzing such information and could help me to find bottlenecks in my code

Thanks!

JanuszL commented 3 years ago

Hi, I would start with checking how DALI threads behave (all DALI traces are assigned to a dedicated domain so you can easily spot them in the viewer). You should check how long each DALI pipeline stage lasts and how that compares to the training. Please also check how the processing looks like between different GPUs. One GPU may have to process very big input and due to that, all other GPUs are waiting idly only for it. If you like you can share profiles/screens for reference.

dariodematties commented 3 years ago

Thank you very much @JanuszL! Here I am attaching a profile for rank 0. If you can, tell me which is your first impression. myprofile_rank_0.qdrep.zip

JanuszL commented 3 years ago

Hi, Looking into the profiles I see that there might be nonnegligible profiling overhead. You may want to run the profiler with --trace=cuda,opengl,nvtx. The profile itself shows that most of the time is spent on the CPU side, on data reading: Maybe you are blocked by the IO as the FileReader doesn't do any significant processing inside.

dariodematties commented 3 years ago

Thank you for your analysis @JanuszL, maybe it is because I am using DALI in interactive way as I explained above.

Such sort of interactive implementation is really important for future development. I needed to know if I could use DALI in that way. Yet in this stage I can implement it differently. For instance bringing and augmenting all the dataset at once before processing. Or maybe advancing fetching a batch while the network is still working processing the previous batch.

Do you think my specific implementation could be affecting performance?

Which are the good practices for using DALI to produce good performance?

Thanks!

JanuszL commented 3 years ago

@dariodematties,

What I see is that the data reading lasts a couple of s each iteration while other work, including inference, is just a couple of ms. I don't think that any overlap would matter here. I would check why the data loading itself is so slow. Are you using network storage or do you have a slow disc?

JanuszL commented 3 years ago

BTW - I'm not sure if I have asked but have you considered using TRITON. There is a dedicated DALI backend for it which could simplify your workflow.

dariodematties commented 3 years ago

Thank you very much @JanuszL, I didn't know TRITON, thanks! I will take a look. I will analyze what you say, but I asked cluster support and yes I am using network storage. Could this be the reason behind this profiling result. I am also seeing a weird behavior, in some very unusual cases the code runs VERY fast, while for most of the runs it is slow. In another code for pretraining I notice that for the first epoch it runs very slow and then --from the second epoch ahead-- it start to run much faster. The difference here between faster an slower is a factor of 20!!

JanuszL commented 3 years ago

Hi, You can try to change the value of:

dont_use_mmap (bool, optional, default = False) –

If set to True, the Loader will use plain file I/O instead of trying to map the file in memory.

Mapping provides a small performance benefit when accessing a local file system, but most network file systems, do not provide optimum performance.

and see if that changes anything. Network storage could be the reason - there could be different caching and optimization strategies.

In another code for pretraining I notice that for the first epoch it runs very slow and then --from the second epoch ahead-- it start to run much faster.

That sounds like a difference between cold disc cache - when data is read directly from the disc/storage and hot disc cache when data is in RAM and there is no direct read from the disc. I don't think DALI can do much about the IO and the storage access speed itself. You can also capture the profile after the first epoch to see how each stage of your flow looks like when the IO is not the main blocker.

dariodematties commented 3 years ago

Hi @JanuszL , excuse my delay, I am somehow stuck with some other tasks. It is just to tell you that I will return asap. Please do not close the issue

Thank you for your help!

dariodematties commented 3 years ago

Hi again @JanuszL , thank you very much for your suggestion. I tested it and it improved a lot. Basically the training process of a logistic regression above resnet showed a similar behavior to the contrastive pretraining process. I mean, slow in the first epoch and really fast afterwards.

Please, look at the speed difference between first and subsequent epochs!

Sometimes it runs fast from the very first epoch (epoch 0) but this behavior is weird

I tried to do profiling but considering that I have to process --at least-- two epochs in order to profile at least one epoch running fast, the output profiling file is enormous and I do not have enough ram to process it in my local laptop (just 8GB)

I would attach the output file here had it not be 241MB

Another point I want to share here is that I was asking to the cluster guys, and I have a 14TB SSD scratch disk per node, but copying imagenet from the network accessible disk to the scratch disks takes forever (after 4 hours copying I decided to abort the command)

Isn't there a mechanism in DALI to take advantage of such disks, I was thinking that maybe DALI moves the dataset to scratch while processing the first epoch

JanuszL commented 3 years ago

Hi @dariodematties,

Isn't there a mechanism in DALI to take advantage of such disks, I was thinking that maybe DALI moves the dataset to scratch while processing the first epoch

I don't think we have ever considered such a use case. The first solution that comes to my mind is to implement a custom Python reader using ExternalSource. In 1.0 (the preview version is available in nightly/weekly builds) we added an ability to parallelize processing and fetch data ahead as well. Please check parallel and prefetch_queue_depth parameters. What you would do is to try to load data from a local disc, if it is not available get it from the network storage and save it to the disc.

dariodematties commented 3 years ago

Excuse my delay @JanuszL , thank you very much for your help. Not now but in the near future I surelly will try to use those SSD through your suggested strategy Thanks!