NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.07k stars 615 forks source link

DALI results is different than pytorch data loader for ASR Models #3356

Closed agemagician closed 2 years ago

agemagician commented 2 years ago

Hello,

We are currently testing DALI with Nvidia Nemo, and we get different results when using DALI compared to using the normal PyTorch data loader.

We have created three Colab examples to reproduce our issue: https://colab.research.google.com/drive/1Rz42EeQVDHhTso3kWvBp1wQc6x1FxXgY?usp=sharing https://colab.research.google.com/drive/1Bk7fUBnTuIvC7JvDPau0lsuBjCmeOYEx?usp=sharing https://colab.research.google.com/drive/1eQe7EjLwjGsO7Dktaj2p9uvb91YYpBQb?usp=sharing

We have open an issue on Nvidia Nemo: https://github.com/NVIDIA/NeMo/issues/2853

However, since this problem is related to DALI, it will be great if you can give us and NeMo team some hints about what could went wrong.

The code for using DALI could be found here: https://github.com/NVIDIA/NeMo/blob/d04c7e9b4ea7055e7fd9777c6e112b8a42c16130/nemo/collections/asr/data/audio_to_text_dali.py It should produce the same results as the code here: https://github.com/NVIDIA/NeMo/blob/d04c7e9b4ea7055e7fd9777c6e112b8a42c16130/nemo/collections/asr/data/audio_to_text.py#L762

Any hint will be highly appreciated.

Thanks.

JanuszL commented 2 years ago

Hi @agemagician,

Thank you for reporting the issue. Let us check it and get back to you soon.

jantonguirao commented 2 years ago

Thank you @agemagician for reporting the issue and for the easy-to-follow reproduction steps! I can confirm this is a bug on the NeMo side. Please see https://github.com/NVIDIA/NeMo/issues/2853 for details

agemagician commented 2 years ago

Thanks a lot, @jantonguirao, for your help and your effort.

I have updated the NeMo issue because there is still a slight difference between DALI and PyTorch data loaders' results.

Thanks again.