MFCC differences between torchaudio and CoreML

🐞Describing the bug

I want to train a model that uses MFCCs as the input. Luckily CoreMLTools is able to convert torchaudio's MFCC :) but there are some numerical differences. I assume those differences arise because CoreML/torchaudio use different parameters (such as FFT size, the number of mel frequencies and so on). I think it is mainly influenced by higher frequencies (reducing torchaudio's n_mels decreases the difference a bit).

What are the recommended parameters to minimize the discprenacy between CoreML/torchaudio's MFCC?

To Reproduce

import torch
import torchaudio
import coremltools
import numpy as np

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.mfcc = torchaudio.transforms.MFCC()

    def forward(self, wav):
        return self.mfcc(wav)

x, fs = torchaudio.load("test.wav", normalize=True)

model = Model()
model.eval()
model = torch.jit.trace(model, x)

y = model(x).numpy()

core_model = coremltools.convert(
    model, convert_to="mlprogram", inputs=[coremltools.TensorType(shape=x.shape)]
)

core_model.save("newmodel.mlpackage")
core_y = core_model.predict({"wav": x.numpy()})

difference = np.abs(next(iter(core_y.values())) - y).mean()
print(difference)  # 0.04909986

System environment (please complete the following information):

coremltools version: 72
OS (e.g. MacOS version or Linux type): MacOs 14.4.1 (M2)
Any other relevant version information: torch 2.2.0

apple / coremltools