apple / coremltools

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.
https://coremltools.readme.io
BSD 3-Clause "New" or "Revised" License
4.46k stars 647 forks source link

Significant Output Discrepancy in CoreML Conversion of PyTorch's Grid Sample- FP16 #2068

Open vinayak-sharan opened 1 year ago

vinayak-sharan commented 1 year ago

🐞Describing the bug

The CoreML model, when converted from a PyTorch model using grid sampling, shows a large deviation in output values compared to the original PyTorch model.

The output difference is notably high. When using FP16 precision for conversion instead of the default FP32, the relative change in output difference is approximately 131.59, or 13159%. This points towards the issue in the conversion process or compatibility between PyTorch's grid sample implementation and CoreML's .

Here is the screen shot attached for the output of below code:

Screenshot 2023-11-22 at 12 48 05

Code To Reproduce

import torch
import coremltools as ct
import numpy as np

# Define a simple PyTorch model that uses grid sampling
class PytorchGridSample(torch.nn.Module):
    def forward(self, input, grid):
        return torch.nn.functional.grid_sample(input, grid, align_corners=False)

def convert_to_coreml(model, inputs, is_float16=True):
    traced_model = torch.jit.trace(
        model, example_inputs=inputs, strict=False)

    coreml_model = ct.converters.convert(traced_model,
                                         inputs=[ct.TensorType(shape=inputs[0].shape),
                                                 ct.TensorType(shape=inputs[1].shape)],
                                         compute_precision=ct.precision.FLOAT16 if is_float16 else ct.precision.FLOAT32)
    return coreml_model

def compare_grid_samples_after_coreml_conversion(pt_model, inputs, is_float16):
    """
    Compare the grid sample output before and after conversion to coreML
    """
    pt_out = pt_model(*inputs)
    coreml_pt_model = convert_to_coreml(pt_model, inputs, is_float16)
    input_names_coreml_pt = [i for i in
                             coreml_pt_model.input_description]
    input_data = {name: val.detach().numpy() for name, val in zip(input_names_coreml_pt, inputs)}

    coreml_pt_out = torch.as_tensor(list(coreml_pt_model.predict(input_data).values())[0])
    diff_pt_coreml = torch.norm(coreml_pt_out - pt_out)
    return diff_pt_coreml

if __name__ == "__main__":
    feat_file = "feat.txt"
    grid_file = "grid.txt"

    input_tensor_shape = (1, 64, 288, 288)
    grid_shape = (1, 288, 288, 2)

    input_tensor = torch.from_numpy(np.loadtxt("feat.txt").reshape(input_tensor_shape)).to(torch.float32)
    grid = torch.from_numpy(np.loadtxt("grid.txt").reshape(grid_shape)).to(torch.float32)

    inputs = [input_tensor, grid]
    pt_model = PytorchGridSample()

    diff_pt_coreml_fp16 = compare_grid_samples_after_coreml_conversion(pt_model, [*inputs], is_float16=True)
    diff_pt_coreml_fp32 = compare_grid_samples_after_coreml_conversion(pt_model, [*inputs],  is_float16=False)

    print(
        f"Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp16 : {diff_pt_coreml_fp16}")

    print(
        f"Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp32 : {diff_pt_coreml_fp32}")

    print(f"Relative change in the difference: {(diff_pt_coreml_fp16 - diff_pt_coreml_fp32) / diff_pt_coreml_fp32}")

System environment (please complete the following information):

Files required for the above code.

Note: Since loading untrusted .pt files poses a security risk, I am sharing the .txt files instead. The file 'feat.txt' exceeds 75 MB and therefore couldn't be uploaded here. To avoid any suspicion of viruses, I am uploading them to GitHub. The links are provided below. Please download the files and save them in the appropriate directory. @TobyRoseman

Suraj209211 commented 1 year ago

Okay I got some info regarding this and making some changes to the code, as you said the output difference is high in using fp16 and fp32. FP16 precision for conversion compared to FP32 suggests that the model is highly sensitive to numerical precision. So simply we have to debug the code once and some changes to grid samples.

Meanwhile can you share any other details regarding this

TobyRoseman commented 1 year ago

Loading untrusted .pt files is a security risk. Please share code which reproduces this issue but does not need .pt files. Maybe you could set a seed (torch.manual_seed) then call torch.randn you get the tensor you need. Or just hard code the tensor values if they aren't large.

Suraj209211 commented 1 year ago

I made some fix to the code, https://github.com/Suraj209211/GitHUB-E1.git. Repository to the fix code in the folder. And let me check on this info you provided for the code. I have decoded the .pt file for grid and features and they are a bit larger file. do u have any other solution to it??

Loading untrusted .pt files is a security risk. Please share code which reproduces this issue but does not need .pt files. Maybe you could set a seed (torch.manual_seed) then call torch.randn you get the tensor you need. Or just hard code the tensor values if they aren't large.

TobyRoseman commented 1 year ago

If the Tensors are too large to hard code, you could try to reproduce this issue with randomly generated tensors. First, set a random seed (torch.manual_seed). Then call torch.randn to get the two tensors you need.

Suraj209211 commented 1 year ago

Gotcha. Let me fix the code according to you

Suraj209211 commented 1 year ago

Tobby I have fix the code but I have encounter with an error for the torch version I am using that is 2.1.1 that the most recent version

Suraj209211 commented 1 year ago

I have degrade the version of torch to run the code , Then also it showinmg me some kinda error. link to the code file is here : https://github.com/Suraj209211/GitHUB-E1/blob/90fdac40204e61257d2d47c7f5b74ea30393e492/BugFIx/coreMLBUG.py#L6 @TobyRoseman let me know where I go wrong..

Suraj209211 commented 12 months ago

@TobyRoseman have you went through the code??

I got an Approach for the same I will be trying it out

TobyRoseman commented 12 months ago

I have not gone through the code. Before we can proceed here, we need code to reproduce the grid_sample issue (that doesn't involve loading .pt files).

What error are you getting now? If this error is unrelated to coremltools, I'm probably not going to be able to help you.

Suraj209211 commented 12 months ago

I am getting an error for the Torch version for CoreML

Suraj209211 commented 12 months ago

So need code to first to produce the grid value from the .pt files. Then we will be using the feature.pt for the same

Suraj209211 commented 11 months ago

@TobyRoseman is this the correct variation, I achieved for the .pt files

Screenshot 2023-12-09 at 14 33 45
Suraj209211 commented 11 months ago

@TobyRoseman if it's fine then lets check the other variation to it. Need a green signal for the bug fix

Suraj209211 commented 11 months ago

Hello @TobyRoseman I hope you seen the output I have updated here

TobyRoseman commented 11 months ago

@Suraj209211 - I don't understand your last few messages.

In order to make progress, we need step to reproduce this issue which does not involved loading .pt files.

Suraj209211 commented 11 months ago

Okay then I got the issues to be solved now. i have to jot down the step to reproduce doesn't involved the .pt files

Suraj209211 commented 11 months ago

Okay then I got the issues to be solved now. i have to jot down the step to reproduce doesn't involved the .pt files

I have figure some step using torch onnx

Suraj209211 commented 11 months ago

@TobyRoseman on reference to your message I have figure out to reproduce issue without '.pt' files.
Link: https://github.com/Suraj209211/GitHUB-E1/blob/main/BugFIx/test.ipynb

In order to make progress, we need step to reproduce this issue which does not involved loading .pt files.

Suraj209211 commented 11 months ago

What next approach I must follow according to you

Suraj209211 commented 10 months ago

Hey @TobyRoseman hope u r fine, I did not find any other comments from you regarding this problem,

Is there any other thing left for this, or the precision for the above code is correct?

vinayak-sharan commented 10 months ago

Hi @TobyRoseman, I understand your concerns, therefore I have updated the steps to reproduce the issue. My apologies for the delayed response.

I have converted the .pt files to .txt files and with the help of numpy.

Since loading untrusted .pt files poses a security risk, I am sharing the .txt files instead. The file 'feat.txt' exceeds 75 MB and therefore couldn't be uploaded here. To avoid any suspicion of viruses, I am uploading them to GitHub. The links are provided below. Please download the files and save them in the appropriate directory.

TobyRoseman commented 10 months ago

@Suraj209211 - the code in your notebook doesn't look right. You're not passing the PyTorch and Core ML model the same input. So of course the output will be different.

@vinayak-sharan - thanks for updating the original code and including the data as text files. I can run your code without error. Here is the output I get:

Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp16 : 5.699944813386537e-05
Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp32 : 5.699944813386537e-05
Relative change in the difference: 0.0

This seems well within the range of acceptable differences. Are you seeing significantly larger differences?

Suraj209211 commented 10 months ago

@Suraj209211 - the code in your notebook doesn't look right. You're not passing the PyTorch and Core ML model the same input. So of course the output will be different.

@vinayak-sharan - thanks for updating the original code and including the data as text files. I can run your code without error. Here is the output I get:

Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp16 : 5.699944813386537e-05
Difference between pytorch's grid sample before and after conversion: Note: Pytorch is fp32 and coreML is fp32 : 5.699944813386537e-05
Relative change in the difference: 0.0

This seems well within the range of acceptable differences. Are you seeing significantly larger differences?

thank you for correcting me @TobyRoseman Is this bug opened or solved??

Suraj209211 commented 10 months ago

@TobyRoseman for original code I am getting the output value as this. I think I may have updated you the wrong code for mistakenly in the GitHub link that I am using for the test. Sorry for the inconvenience.

The Output:

Screenshot 2024-01-10 at 23 39 30
Suraj209211 commented 10 months ago

(https://github.com/Suraj209211/Apple/blob/main/fix2.ipynb) this is the correct link to the code for which I am getting the output