Significant Output Discrepancy in CoreML Conversion of PyTorch's Grid Sample- FP16

vinayak-sharan commented 1 year ago

🐞Describing the bug

The CoreML model, when converted from a PyTorch model using grid sampling, shows a large deviation in output values compared to the original PyTorch model.

The output difference is notably high. When using FP16 precision for conversion instead of the default FP32, the relative change in output difference is approximately 131.59, or 13159%. This points towards the issue in the conversion process or compatibility between PyTorch's grid sample implementation and CoreML's .

Here is the screen shot attached for the output of below code:

Code To Reproduce

Also, attached feature and grid tensors in the link, the code will convert ".txt" files to tensors.

import torch
import coremltools as ct
import numpy as np

# Define a simple PyTorch model that uses grid sampling
class PytorchGridSample(torch.nn.Module):
    def forward(self, input, grid):
        return torch.nn.functional.grid_sample(input, grid, align_corners=False)

def convert_to_coreml(model, inputs, is_float16=True):
    traced_model = torch.jit.trace(
        model, example_inputs=inputs, strict=False)

    coreml_model = ct.converters.convert(traced_model,
                                         inputs=[ct.TensorType(shape=inputs[0].shape),
                                                 ct.TensorType(shape=inputs[1].shape)],
                                         compute_precision=ct.precision.FLOAT16 if is_float16 else ct.precision.FLOAT32)
    return coreml_model

def compare_grid_samples_after_coreml_conversion(pt_model, inputs, is_float16):
    """
    Compare the grid sample output before and after conversion to coreML
    """
    pt_out = pt_model(*inputs)
    coreml_pt_model = convert_to_coreml(pt_model, inputs, is_float16)
    input_names_coreml_pt = [i for i in
                             coreml_pt_model.input_description]
    input_data = {name: val.detach().numpy() for name, val in zip(input_names_coreml_pt, inputs)}

    coreml_pt_out = torch.as_tensor(list(coreml_pt_model.predict(input_data).values())[0])
    diff_pt_coreml = torch.norm(coreml_pt_out - pt_out)
    return diff_pt_coreml

if __name__ == "__main__":
    feat_file = "feat.txt"
    grid_file = "grid.txt"

    input_tensor_shape = (1, 64, 288, 288)
    grid_shape = (1, 288, 288, 2)

    input_tensor = torch.from_numpy(np.loadtxt("feat.txt").reshape(input_tensor_shape)).to(torch.float32)
    grid = torch.from_numpy(np.loadtxt("grid.txt").reshape(grid_shape)).to(torch.float32)

    inputs = [input_tensor, grid]
    pt_model = PytorchGridSample()

    diff_pt_coreml_fp16 = compare_grid_samples_after_coreml_conversion(pt_model, [*inputs], is_float16=True)
    diff_pt_coreml_fp32 = compare_grid_samples_after_coreml_conversion(pt_model, [*inputs],  is_float16=False)

    print(
        f"Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp16 : {diff_pt_coreml_fp16}")

    print(
        f"Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp32 : {diff_pt_coreml_fp32}")

    print(f"Relative change in the difference: {(diff_pt_coreml_fp16 - diff_pt_coreml_fp32) / diff_pt_coreml_fp32}")

System environment (please complete the following information):

coremltools version: 7.1
OS: MacOS-14.1
PyTorch: 2.0.1

Files required for the above code.

Note: Since loading untrusted .pt files poses a security risk, I am sharing the .txt files instead. The file 'feat.txt' exceeds 75 MB and therefore couldn't be uploaded here. To avoid any suspicion of viruses, I am uploading them to GitHub. The links are provided below. Please download the files and save them in the appropriate directory. @TobyRoseman

Suraj209211 commented 1 year ago

Okay I got some info regarding this and making some changes to the code, as you said the output difference is high in using fp16 and fp32. FP16 precision for conversion compared to FP32 suggests that the model is highly sensitive to numerical precision. So simply we have to debug the code once and some changes to grid samples.

Meanwhile can you share any other details regarding this

TobyRoseman commented 1 year ago

Loading untrusted .pt files is a security risk. Please share code which reproduces this issue but does not need .pt files. Maybe you could set a seed (torch.manual_seed) then call torch.randn you get the tensor you need. Or just hard code the tensor values if they aren't large.

Suraj209211 commented 1 year ago

I made some fix to the code, https://github.com/Suraj209211/GitHUB-E1.git. Repository to the fix code in the folder. And let me check on this info you provided for the code. I have decoded the .pt file for grid and features and they are a bit larger file. do u have any other solution to it??

Loading untrusted .pt files is a security risk. Please share code which reproduces this issue but does not need .pt files. Maybe you could set a seed (torch.manual_seed) then call torch.randn you get the tensor you need. Or just hard code the tensor values if they aren't large.

TobyRoseman commented 1 year ago

If the Tensors are too large to hard code, you could try to reproduce this issue with randomly generated tensors. First, set a random seed (torch.manual_seed). Then call torch.randn to get the two tensors you need.

Suraj209211 commented 1 year ago

Gotcha. Let me fix the code according to you

Suraj209211 commented 1 year ago

Tobby I have fix the code but I have encounter with an error for the torch version I am using that is 2.1.1 that the most recent version

Suraj209211 commented 1 year ago

I have degrade the version of torch to run the code , Then also it showinmg me some kinda error. link to the code file is here : https://github.com/Suraj209211/GitHUB-E1/blob/90fdac40204e61257d2d47c7f5b74ea30393e492/BugFIx/coreMLBUG.py#L6 @TobyRoseman let me know where I go wrong..

Suraj209211 commented 12 months ago

@TobyRoseman have you went through the code??

I got an Approach for the same I will be trying it out

TobyRoseman commented 12 months ago

I have not gone through the code. Before we can proceed here, we need code to reproduce the grid_sample issue (that doesn't involve loading .pt files).

What error are you getting now? If this error is unrelated to coremltools, I'm probably not going to be able to help you.

Suraj209211 commented 12 months ago

I am getting an error for the Torch version for CoreML

Suraj209211 commented 12 months ago

So need code to first to produce the grid value from the .pt files. Then we will be using the feature.pt for the same