Open etienne87 opened 4 years ago
Hi, DALI doesn't utilize multiple processes but multiple threads on the native level. The approach that PyTorch has taken is the result of a lack of the ability to perform true multithread processing in Python. In the case of DALI, it is not an issue as the whole processing happens on the native side without the need to obtain Python GIL. Answering your question when you use build-in DALI operators there should not be that much performance difference other than coming from the fact that some processing is done at python level in the case of PyTorch data loader, while DALI does everything on the native side and there is no need to perform interprocess communication via shared memory. The difference will show up when you ask DALI to use Python operator - in this case, the Python GIL limitation comes to play and DALI cannot utilize parallel processing.
thanks for answering so fast! Just to be sure, you would say then that Dali is worth it considering preprocessing has not the latency of python calls (to opencv functions for instance) and memory sharing in c++ is faster thanks to usage of threads and not multiprocessing right? i am just looking at a rough estimation of runtime relative difference (say idk 1.5x?) in case of sharing those video-clips in a temporally coherent manner accross batches (probably it grows with video-clip resolution and length).
What I wanted to point out were the things that should be considered, but I cannot tell how much they can contribute to the overall performance difference (also I don't know that well the PyTorch dataloader implementation). The overall overhead will contribute less to the total time as data gets bigger - the amount processing done by each invocation will grow (amount of OpenCV work) while the invocation time is more or less constant. You also need to consider that each operation implemented in OpenCV and DALI may have differences in the performance as well. We plan to implement the Resize operator for the Video soon so you can run side by side perf comparison on your own.
Sorry if it is not an issue per-say. Would it be interesting to benchmark the nvidia-dali versus the iterable-style Pytorch Dataloader which also allows stateful streaming in multiple processes? (or even a custom multiprocessing in python + numpy)? do we expect much more perf on the nvidia-dali side?
opencv_streamer.py
stream_loader.py