A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Apache License 2.0
5.1k stars 615 forks source link

Observing difference in preprocessed output between dali and torch transform #4257

Closed shrinath-suresh closed 2 years ago

shrinath-suresh commented 2 years ago

While comparing the tensor output from torch transforms and dali transforms, we are observing difference in output.

Attaching the notebook with full reproducible example - Dali preprocessing repro.zip

Or below steps can be followed

Download the sample image (kitten.jpg) from here

Load the image as bytes (torchserve uses bytearray as input) . Hence we want to keep it this way

with open("kitten.jpg", "rb") as fp:
    img_data = fp.read()

preprocess the image with torch transforms

torch_transform = transforms.Compose([
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])

PILImage = Image.open(io.BytesIO(img_data))
torch_transformed_tensor = torch_transform(PILImage)

Define Dali pipeline

@pipeline_def(batch_size=1, num_threads=1, device_id=0)
def dali_pipeline(batch_tensor):
    jpegs = dali.fn.external_source(source=[batch_tensor], dtype=types.UINT8)
    jpegs = dali.fn.decoders.image(jpegs, device="mixed")
    jpegs = dali.fn.resize(jpegs, size=[256], subpixel_scale=False, interp_type=types.DALIInterpType.INTERP_LINEAR, mode="not_smaller")
    normalized = dali.fn.crop_mirror_normalize(
        crop=(224, 224),
        mean=[0.485 * 255,0.456 * 255,0.406 * 255],
        std=[0.229 * 255,0.224 * 255,0.225 * 255],

    return normalized

Preprocess using dali pipeline

# convert the image to numpy array
batch_tensor = []
np_image = np.frombuffer(img_data, dtype=np.uint8)
result = []
datam = PyTorchIterator([dali_pipeline(batch_tensor)], ['data'], last_batch_policy=LastBatchPolicy.PARTIAL, last_batch_padded=True)
for i, data in enumerate(datam):

dali_transformed_tensor = result[0].squeeze(0).detach().cpu()

Tensor output from torch transform

tensor([[[-0.1657, -0.3712, -0.5938,  ..., -0.9192, -0.9192, -0.9192],
         [-0.0287, -0.2171, -0.4226,  ..., -0.8335, -0.8507, -0.8678],
         [ 0.0569, -0.1143, -0.2856,  ..., -0.7822, -0.7993, -0.7993],
         [ 1.7009,  1.4954,  1.6324,  ..., -0.6109, -0.5938, -0.4911],
         [ 1.7865,  1.7523,  1.7865,  ..., -0.6109, -0.5767, -0.4739],
         [ 1.7865,  1.7180,  1.7352,  ..., -0.5938, -0.5424, -0.4568]],

        [[ 0.0301, -0.1800, -0.4076,  ..., -0.5476, -0.5476, -0.5301],
         [ 0.2052, -0.0049, -0.2150,  ..., -0.4601, -0.4601, -0.4776],
         [ 0.3102,  0.1176, -0.0749,  ..., -0.4076, -0.3901, -0.4076],
         [ 1.7108,  1.5182,  1.6232,  ..., -0.2675, -0.2500, -0.1625],
         [ 1.8158,  1.7983,  1.7983,  ..., -0.2675, -0.2325, -0.1450],
         [ 1.8508,  1.7283,  1.7458,  ..., -0.2675, -0.2325, -0.1450]],

        [[ 0.1999, -0.1138, -0.3927,  ..., -1.0724, -1.0898, -1.0898],
         [ 0.4265,  0.1476, -0.1487,  ..., -1.0027, -1.0201, -1.0376],
         [ 0.5659,  0.3393,  0.0605,  ..., -0.9330, -0.9504, -0.9678],
         [ 1.8034,  1.5768,  1.6640,  ..., -0.7936, -0.7587, -0.6018],
         [ 1.9080,  1.8383,  1.8383,  ..., -0.7936, -0.7413, -0.5844],
         [ 1.9603,  1.7860,  1.8208,  ..., -0.8110, -0.7587, -0.6018]]])

Tensor output from dali tranform

tensor([[[-0.1657, -0.3883, -0.5938,  ..., -0.9192, -0.9192, -0.9192],
         [-0.0287, -0.2342, -0.4226,  ..., -0.8335, -0.8507, -0.8678],
         [ 0.0569, -0.1143, -0.2856,  ..., -0.7822, -0.7993, -0.7993],
         [ 1.6667,  1.4954,  1.6324,  ..., -0.6281, -0.6109, -0.5082],
         [ 1.7523,  1.7352,  1.7694,  ..., -0.6109, -0.5767, -0.4739],
         [ 1.7865,  1.7180,  1.7352,  ..., -0.5938, -0.5596, -0.4568]],

        [[ 0.0301, -0.1975, -0.4076,  ..., -0.5476, -0.5476, -0.5476],
         [ 0.1877, -0.0224, -0.2150,  ..., -0.4776, -0.4776, -0.4951],
         [ 0.3102,  0.1176, -0.0749,  ..., -0.4076, -0.4076, -0.4251],
         [ 1.7108,  1.5182,  1.6232,  ..., -0.2850, -0.2675, -0.1800],
         [ 1.7983,  1.7808,  1.7808,  ..., -0.2675, -0.2500, -0.1625],
         [ 1.8333,  1.7283,  1.7283,  ..., -0.2675, -0.2500, -0.1450]],

        [[ 0.1999, -0.1312, -0.4101,  ..., -1.0898, -1.0898, -1.1073],
         [ 0.4091,  0.1302, -0.1487,  ..., -1.0201, -1.0376, -1.0550],
         [ 0.5485,  0.3219,  0.0605,  ..., -0.9330, -0.9504, -0.9678],
         [ 1.8034,  1.5768,  1.6465,  ..., -0.7936, -0.7587, -0.6193],
         [ 1.9080,  1.8383,  1.8383,  ..., -0.7936, -0.7413, -0.6018],
         [ 1.9428,  1.7860,  1.8034,  ..., -0.8284, -0.7761, -0.6193]]])

Both tensors are not equal , as the below code return False

torch.allclose(torch_transformed_tensor, dali_transformed_tensor)

We observe a difference of 0.07 delta between these tensors, as the below code returns true

torch.allclose(torch_transformed_tensor, dali_transformed_tensor, atol=0.07)

Passing the preprocessed output to the resnet pretrainned model predicts the same class in both torch and dali. However, the probabilities differ.

Output using torch tensor

tabby 0.4096631407737732
tiger_cat 0.3467048108577728
Egyptian_cat 0.13002879917621613
lynx 0.023919595405459404
bucket 0.011532180942595005

output using dali

tabby 0.4087514281272888
tiger_cat 0.3540496230125427
Egyptian_cat 0.12418904155492783
lynx 0.025347236543893814
bucket 0.011393276043236256

We would like to know if the behaviour is expected ? Or is there any way to fix the preprocessed tensor output (to be same between torch transform and dali).

Note: i have already went through the existing closed issue - https://github.com/NVIDIA/DALI/issues/3610 and updated the pipeline accordingly.

Tested with Pytorch 1.12.1, Cuda 11.3, and dali 1.16.1.

JanuszL commented 2 years ago

Hi @shrinath-suresh,

In the case of torchvision, you need to use the former INTERP_TRIANGULAR interpolation type that can be achieved as INTERP_LINEAR with antialias=True, as torchvision enables antialiasing by default when using linear interpolation. On top of that, another source of discrepancy is JPEG decoding. There's no JPEG decoding standard - in general, the better the PSNR for encode-decode, the better, so different decoders employ different tricks to improve the result - sometimes optimized for images from a specific field. In the case of nvJPEG (DALI uses under the hood for the GPU acceleration of the decoding process) conversation from YUV to RGB uses a different interpolation strategy than libjpeg-turbo used for the CPU decoding (botch in torchvision and DALI). So the mentioned pipeline should yield better results:

@pipeline_def(batch_size=1, num_threads=1, device_id=0)
def dali_pipeline(batch_tensor):

    jpegs = dali.fn.external_source(source=[batch_tensor], dtype=types.UINT8)
    jpegs = dali.fn.decoders.image(jpegs, device="cpu")
    jpegs = dali.fn.resize(jpegs, size=[256], subpixel_scale=False, interp_type=types.DALIInterpType.INTERP_LINEAR, antialias=True, mode="not_smaller")
    normalized = dali.fn.crop_mirror_normalize(
        crop=(224, 224),
        mean=[0.485 * 255,0.456 * 255,0.406 * 255],
        std=[0.229 * 255,0.224 * 255,0.225 * 255],

    return normalized

You can check this standalone example for reference Resize_example.zip.

Still, it is more or less expected that if the inference is run with different data processing pipeline (which is not bit-exact) than the network was trained with, the results will be slightly different).