Closed paulbisso closed 5 years ago
Hi Paul,
Really happy to hear that DALI is clearely boosting your trainings :)
We are aware of this problem about Resize having different results from PIL’s one used in PyTorch. Resize operator actually uses OpenCV for the CPU version and npp for the GPU one. Those yields different results from PIL implementation, it seems to not be a new problem https://github.com/python-pillow/Pillow/issues/2718
@awolant
Hi Paul, thanks for your comment, great to work with such an involved contributor :) We focus our effort to resolve this issue, as it also emerged in different contexts. I will keep you posted on the progress here.
Tracked as DALI-480
https://github.com/NVIDIA/DALI/pull/435 should address this problem. GPU variant of that fix is under the development.
We improved Resize function by employing resampling in https://github.com/NVIDIA/DALI/pull/520. In case of RN50 the final accuracy in rather not affected but in case of some networks it could make a difference.
We are currently migrating our inference to Triton and therefore running some reproducibility test. However, just like @paulbisso, on one random image, we saw +/- 12 / 255 pixel difference between PILImage & Dali resize (both using linear interp). Any way to get closer results? Best
Hi @vriviere-odin,
Thank you for reaching out.
Can you check if antialias=True
and if changing subpixel_scale
value in DALI resize changes anything?
Please also check this thread , especially https://github.com/NVIDIA/DALI/issues/4257#issuecomment-1249008249 for reference.
I'm working with @vriviere-odin and made a very small notebook based on yours to show the difference during resizing only.
We're seeing 0.15% difference with PIL's resizing on average, with a peak at 11.6%. We tested pretty much every combination of arguments to try to reproduce PIL's behaviour, including antialias
and subpixel_scale
.
Hi @stap-odin,
Thank you for providing a comprehensive example. It seems that the difference is on the borders, other are mostly off by 1. Adding:
diff[(diff < 2)] = 0
Reduces the average error by the order of magnitude -> 0.01%.
The difference in edge pixels stems from different border handling. DALI uses an equivalent of OpenCV's BORDER_REPLICATE; PIL rejects out-of-range pixels and renormalizes the kernel. The issue becomes relevant when sampling outside the image with kernels larger than bilinear (Cubic, Lanczos, Linear+antialias when downscaling). Example:
Source pixels (monochrome):
[1, 5, 3]
Kernel: [0.2, 0.5, 0.3]
, centered at the leftmost pixel.
PIL:
(1 * 0.0 1 * 0.5 + 5 * 0.3) / (0.5 + 0.3) = 2.5
^^^--- rejected ^^^^^^^^^^^^^---- renormalized
DALI:
1 * 0.2 + 1 * 0.5 + 5 * 0.3 = 2.2
^--- replicated
Thank you both @JanuszL and @mzient for your precise and quick responses. We would have missed the differences being only on the borders. Best
Hi -- I've been able to confirm that using the DALI dataloader gives me a 3-5X epoch time speed-up over the PyTorch native dataloader on equivalent hardware running the same neural net training routine. Thanks a lot, this is a great resource!
However, I have one question (here) and one remaining issue (I'll address separately).
Question: what is the difference between the PIL.Image.BILINEAR convolution used to resize images in PyTorch and the DALI ops.Resize() function using the default interpolation type? (linear)
When I run identical images through identical DALI and PyTorch processing pipelines that do not involve image resizing, even when they include cropping/normalization/etc., I get nearly identical images from both pipeline. Nice!
However, when I use ops.Resize() and PyTorch's torchvision.transforms.Resize() (which utilizes PIL.Image.BILINEAR interpolation), I see large (>10 units out of 255) pixel-level differences in my images.
I'm not flagging it as an issue because although it results in validation set loss figures not being directly comparable between PyTorch vs. DALI dataloaders on an identical image set with identical processing, I'm still able to train a network with DALI using validation loss as a guide to when training should finish.
But it's probably important for users to understand what is being done here and how it differs from what is being done by PIL.
Thank you!