NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.04k stars 614 forks source link

Trouble with multithreading #1812

Closed jwitos closed 3 years ago

jwitos commented 4 years ago

Hi,

Thanks a lot for your work on the project! I've been trying to use DALI to speed up my workflow in processing medical imaging. I wrote a custom Iterator for loading NRRD files and a very simple Pipeline that just performs ops.Cast() and ops.Resize() and returns the pixel array.

Then, I tried to iterate through my data with:

pipe = NrrdPipeline(batch_size=4,
                    num_threads=12,
                    device_id=0)
pipe.build()
for n_batch, data in enumerate(nrrd_iterator):
        print("Batch {}".format(n_batch))

This is really slow, though, and looking at top only one thread/core is being used. I'm loading data from the SSD so it's not a I/O issue. I feel like I'm missing something very obvious here and my inexperience doesn't allow to see it clearly 😄 Should I be doing my file loading & array processing fully in Pipeline instead of inside Iterator? Tried moving the file loading to iter_setup() part of Pipeline, but the problem still persists

Dzięki za pomoc!

JanuszL commented 4 years ago

Hi, Could you share a pipeline you are using? As I understand it consists of ExternalSource to load the data and ops.Cast() and ops.Resize() to process it. Do these operators locate on GPU (if so then you won't see high CPU utilization running them)? My educated guess is that the root cause is the ExternalSource which runs synchronously with the main Python thread and this operation is not accelerated by multithreading. We have some idea how to split this particular operator into multiple workers but this is a very long shot so hard to tell when we get to this. So this could be a bottleneck in your case. But it is hard to tell more without seeing any code snippet. You may try to put your data into one of the supported containers - TFRecord for example - its reading is split into multiple threads and it would work faster than the ExternalSource.

Staramy się pomagać jak się da :-)

mzient commented 4 years ago

Hello, I see you specify num_threads = 12, but typically the CPU operators distribute whole batch samples to different threads - so with batch_size = 4 you won't get any boost from specifying num_threads greater than four. Also, if the samples vary greatly in size, one of them will of course take longer to be processed.

Polecamy się na przyszłość!

liangxiao05 commented 4 years ago

I met this problem too......I customized my own ExternalInputIterator ,inside it I do much image preprocess functions on cpu which are not available to run by DALI . The ExternalInputIterator will return a batch of data (batch_size is 50 per gpu),then feed it into the ExternalSourcePipeline by ops.ExternalSource(). All the shcedule runs well ,only one problem: if I use some heavy operations within image preprocess schedule on cpu, all the schedule will become very slow per batch ,,even is 3X slower then the pytorch Dataloader.....After carefully debug and look into the codes,I find that pytorch dataloader use more cpu cores and multi threads to preprocess the images ,however dali pipeline only use less cores and set num_threads to any value have no effect to the speed. Here's my test results , gpu_num:2, batch_size:50(per gpu) Time used for prepare one batch_size data by pytorch dataloader and Dali pipeline

train_dali_numthread_1 2.60746693611145 train_dali_numthread_1 2.5747182369232178 train_dali_numthread_1 2.6421051025390625 train_dali_numthread_1 2.696887969970703 train_dali_numthread_1 2.7098922729492188 train_dali_numthread_1 2.503169536590576 train_dali_numthread_1 2.5508782863616943 train_dali_numthread_1 2.775343894958496

train_pytorchdataloader_numthread_1 4.18388032913208 train_pytorchdataloader_numthread_1 3.953681468963623 train_pytorchdataloader_numthread_1 4.04970645904541 train_pytorchdataloader_numthread_1 3.969261646270752 train_pytorchdataloader_numthread_1 4.076496839523315 train_pytorchdataloader_numthread_1 3.865492343902588 train_pytorchdataloader_numthread_1 4.394745349884033 train_pytorchdataloader_numthread_1 3.989100694656372 train_pytorchdataloader_numthread_1 3.94582462310791

train_dali_numthread_8 2.6638147830963135 train_dali_numthread_8 2.8533313274383545 train_dali_numthread_8 2.6639766693115234 train_dali_numthread_8 2.612551212310791 train_dali_numthread_8 2.629957914352417 train_dali_numthread_8 2.783665418624878 train_dali_numthread_8 2.6219711303710938 train_dali_numthread_8 2.6218550205230713

train_pytorchdataloader_numthread_8 7.28627610206604 train_pytorchdataloader_numthread_8 0.00013518333435058594 train_pytorchdataloader_numthread_8 0.0003066062927246094 train_pytorchdataloader_numthread_8 0.00021123886108398438 train_pytorchdataloader_numthread_8 0.0002377033233642578 train_pytorchdataloader_numthread_8 0.0001881122589111328 train_pytorchdataloader_numthread_8 0.00017261505126953125 train_pytorchdataloader_numthread_8 0.00018334388732910156 train_pytorchdataloader_numthread_8 6.7942399978637695 train_pytorchdataloader_numthread_8 8.559226989746094e-05 train_pytorchdataloader_numthread_8 0.00018477439880371094 train_pytorchdataloader_numthread_8 0.00018334388732910156 train_pytorchdataloader_numthread_8 0.0001926422119140625 train_pytorchdataloader_numthread_8 0.00019240379333496094 train_pytorchdataloader_numthread_8 0.00018525123596191406 train_pytorchdataloader_numthread_8 0.00016808509826660156 train_pytorchdataloader_numthread_8 5.490636587142944 train_pytorchdataloader_numthread_8 0.0003657341003417969

JanuszL commented 4 years ago

Hi,

@liangxiao05 that is expected. PyTorch has sophisticated functionality that spreads work into multiple dataloader processes as Python multithreading is far from perfect regarding the performance. In the case of DALI, we relay on threading on the native side, not the Python one. The ExternalSource operator is not designed to be multithreaded. Any processing at the Python level in DALI is mostly for prototyping and debugging and we are not promising good performance.

liangxiao05 commented 4 years ago

Thanks for your explantions . I wil try some other ways.

JanuszL commented 4 years ago

@liangxiao05 - you can always create a custom operator that would do your processing inside the DALI pipeline. You can find a tutorial here.

liangxiao05 commented 4 years ago

This 's much harder to me because I hav't use C++ for a long time ,LOL ..Whatever, thanks very much.

JanuszL commented 3 years ago

Hi @liangxiao05,

The most recent improvements in the external_source operator make it possible to scale it to multiple CPUs. Please check parallel and prefetch_queue_depth parameters.