Mismatch in predictions between original PyTorch model and converted CoreML model

🐞Describing the bug

An interesting situation. I have a PyTorch model for image classification, though the details does not matter. I need to make it work on mobile devices so I decided to convert it into TFLite and CoreML formats. The problem is that predictions of the original PyTorch model and the converted TFLite model are basically identical, BUT the mismatch in predictions between the original model and the converted CoreML model is on the order of 1e-2, which is a substantial error.

It would be hard for me to give you a complete reproducible example, but I will give you the main pieces, which should be enough.

In case you want detais about the dataset, preprocessing, or anything like that, in my opinion this does not matter in THIS particular case, because they are the same for TFLite and CoreML models, so they must work.

To Reproduce

The model is the timm model. I use a finetuned version, but you can take a pretrained one. Architecture is identical.

model = timm.create_model('tf_efficientnetv2_s.in21k_ft_in1k', pretrained=True, num_classes=2, in_chans=3)

This is the conversion function:

def export_model(model: torch.nn.Module, output_base_path: str, representation: str) -> Path:
    assert not model.training

    os.makedirs(output_base_path, exist_ok=True)

    sample_input = torch.randn((1, 3, 512, 512))

    traced_model = torch.jit.trace(model, (sample_input,))

    coreml_model = ct.convert(
        model=traced_model,
        convert_to=representation,
        inputs=[ct.TensorType(shape=sample_input.shape, dtype=np.float32)],
        outputs=[ct.TensorType(dtype=np.float32)],
        compute_precision=ct.precision.FLOAT32 if representation == "mlprogram" else None,
        compute_units=ct.ComputeUnit.CPU_ONLY
    )

    ext = "mlpackage" if representation == "mlprogram" else "mlmodel"
    export_path = Path(output_base_path, f"model.{ext}")

    coreml_model.save(export_path)

    return export_path

And here is the validation code:

coreml_model = ct.models.MLModel(converted_model_path)

torch_model = load_model(checkpoint_path)

stats_coreml = []
stats_torch = []

dataset = load_dataset(dataset_path)
for image in dataset:
    coreml_output = coreml_model.predict({"x_1": image}).get("linear_0")
    torch_output = torch_model(image).detach().numpy()

    stats_coreml.append(torch.sigmoid(torch.from_numpy(coreml_output)).numpy())
    stats_torch.append(torch.sigmoid(torch.from_numpy(torch_output)).numpy())

    match = np.allclose(coreml_output , torch_output , atol=tolerance, rtol=0)
    assert match, (
         f"The difference between predictions of original and converted "
         f"models is greater than the allowed tolerance of {tolerance}"
    )

stats_coreml = np.array(stats_coreml)
stats_torch = np.array(stats_torch)

print('Max absolute difference:', np.abs((stats_coreml - stats_torch)).max())
print('Min absolute difference:', np.abs((stats_coreml - stats_torch)).min())
print('Mean absolute difference:', np.abs((stats_coreml - stats_torch)).mean())

The TFLite case is analogous. So here is the statistics for TFLite:

Max absolute difference: 6.556511e-07 Min absolute difference: 0.0 Mean absolute difference: 3.983344e-08

And here is for CoreML:

Max absolute difference: 0.012094557 Min absolute difference: 8.384697e-06 Mean absolute difference: 0.0028901247

Please, help me. Thank you!

Just in case, here is TFLite conversion (I use ai_edge_torch package):

sample_input = torch.randn((1, 3, 512, 512))
tflite_model = ai_edge_torch.convert(model, (sample_input,))
tflite_model.export(export_path)

System environment:

coremltools version is 8.0 but for 7.0 is the same.
OS: converted on MacOS 13.6.9 (Ventura) and Ubuntu 20.04, tested on MacOS 13.6.9 (Ventura), the same.
PyTorch 2.2.1, Tensorflow 2.16.2, Python 3.10.15, NumPy 1.26.4.

apple / coremltools