NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.13k stars 619 forks source link

Differences between native PyTorch transforms.Resize() and DALI's ops.Resize()? #422

Closed paulbisso closed 5 years ago

paulbisso commented 5 years ago

Hi -- I've been able to confirm that using the DALI dataloader gives me a 3-5X epoch time speed-up over the PyTorch native dataloader on equivalent hardware running the same neural net training routine. Thanks a lot, this is a great resource!

However, I have one question (here) and one remaining issue (I'll address separately).

Question: what is the difference between the PIL.Image.BILINEAR convolution used to resize images in PyTorch and the DALI ops.Resize() function using the default interpolation type? (linear)

When I run identical images through identical DALI and PyTorch processing pipelines that do not involve image resizing, even when they include cropping/normalization/etc., I get nearly identical images from both pipeline. Nice!

However, when I use ops.Resize() and PyTorch's torchvision.transforms.Resize() (which utilizes PIL.Image.BILINEAR interpolation), I see large (>10 units out of 255) pixel-level differences in my images.

I'm not flagging it as an issue because although it results in validation set loss figures not being directly comparable between PyTorch vs. DALI dataloaders on an identical image set with identical processing, I'm still able to train a network with DALI using validation loss as a guide to when training should finish.

But it's probably important for users to understand what is being done here and how it differs from what is being done by PIL.

Thank you!

Kh4L commented 5 years ago

Hi Paul,

Really happy to hear that DALI is clearely boosting your trainings :)

We are aware of this problem about Resize having different results from PIL’s one used in PyTorch. Resize operator actually uses OpenCV for the CPU version and npp for the GPU one. Those yields different results from PIL implementation, it seems to not be a new problem https://github.com/python-pillow/Pillow/issues/2718

https://github.com/python-pillow/Pillow/blob/55e5b7de6c41b0386660b0bee7784ac04f412f4b/src/libImaging/Resample.c

Kh4L commented 5 years ago

@awolant

awolant commented 5 years ago

Hi Paul, thanks for your comment, great to work with such an involved contributor :) We focus our effort to resolve this issue, as it also emerged in different contexts. I will keep you posted on the progress here.

JanuszL commented 5 years ago

Tracked as DALI-480

JanuszL commented 5 years ago

https://github.com/NVIDIA/DALI/pull/435 should address this problem. GPU variant of that fix is under the development.

JanuszL commented 5 years ago

We improved Resize function by employing resampling in https://github.com/NVIDIA/DALI/pull/520. In case of RN50 the final accuracy in rather not affected but in case of some networks it could make a difference.

vriviere-odin commented 1 year ago

We are currently migrating our inference to Triton and therefore running some reproducibility test. However, just like @paulbisso, on one random image, we saw +/- 12 / 255 pixel difference between PILImage & Dali resize (both using linear interp). Any way to get closer results? Best

JanuszL commented 1 year ago

Hi @vriviere-odin,

Thank you for reaching out. Can you check if antialias=True and if changing subpixel_scale value in DALI resize changes anything? Please also check this thread , especially https://github.com/NVIDIA/DALI/issues/4257#issuecomment-1249008249 for reference.

stap-odin commented 1 year ago

I'm working with @vriviere-odin and made a very small notebook based on yours to show the difference during resizing only.

We're seeing 0.15% difference with PIL's resizing on average, with a peak at 11.6%. We tested pretty much every combination of arguments to try to reproduce PIL's behaviour, including antialias and subpixel_scale.

JanuszL commented 1 year ago

Hi @stap-odin,

Thank you for providing a comprehensive example. It seems that the difference is on the borders, other are mostly off by 1. Adding:

diff[(diff < 2)] = 0

Reduces the average error by the order of magnitude -> 0.01%.

mzient commented 1 year ago

The difference in edge pixels stems from different border handling. DALI uses an equivalent of OpenCV's BORDER_REPLICATE; PIL rejects out-of-range pixels and renormalizes the kernel. The issue becomes relevant when sampling outside the image with kernels larger than bilinear (Cubic, Lanczos, Linear+antialias when downscaling). Example:

Source pixels (monochrome):

[1, 5, 3]

Kernel: [0.2, 0.5, 0.3], centered at the leftmost pixel. PIL:

(1 * 0.0  1 * 0.5 + 5 * 0.3) / (0.5 + 0.3) = 2.5
     ^^^--- rejected         ^^^^^^^^^^^^^---- renormalized

DALI:

1 * 0.2 + 1 * 0.5 + 5 * 0.3 = 2.2
^--- replicated
stap-odin commented 1 year ago

Thank you both @JanuszL and @mzient for your precise and quick responses. We would have missed the differences being only on the borders. Best