Matching MFCC configuration with Librosa's

Yaxit commented 1 year ago

Hello,

I am trying to replicate the MFCC output of Librosa, which is widely used as the reference library for audio manipulation.

   mfccs = librosa.feature.mfcc(y=data,
        sr=16000,
        n_mfcc=32,
        n_fft = 2048,
        hop_length = 1024,
        window='hamming',
        center=False,            # for padding, not used
        pad_mode='constant',    # for padding, not used
        power=2.0,
        n_mels=32,
        fmin=64,
        fmax=8000,
        dct_type=2,
        norm='ortho'
        )

On my ARM microcontroller, I am using the arm_mfcc_f32 callback and the arm_mfcc_init_f32 to initialize the parameters. The parameters have been generated with the scripts from this repository.

Generally, I would expect some small deviation from the results on the two platforms, since CMSIS uses approximated calculations for logarithms, etc; however, I found that the outputs are completely different (one order of magnitude, sign).

I also could not find a clear explanation of which parameters are used in the CMSIS implementation for power, dct_type, and norm. Am I missing some steps? Or is this expected altogether?

I believe this would be quite important to clarify, to allow people to design a data pipeline that can be quite reproducible in the microcontroller :)

christophe0606 commented 1 year ago

@Yaxit It was tested with the MFCC from TensorFlow using Hamming window

If Librosa is a reference then it may be worth looking at it for a future improvement.

The dct_type and norm look coherent with the TensorFlow version : https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms

From TensorFlow documentation:

Mel-Frequency Cepstral Coefficient (MFCC) calculation consists of taking the DCT-II of a log-magnitude mel-scale spectrogram. HTK's MFCCs use a particular scaling of the DCT-II which is almost orthogonal normalization. We follow this convention.

I think power should be 1.

christophe0606 commented 1 year ago

@Yaxit I have made some tests. Here is a Python script to compute the 3 versions : TensorFlow, CMSIS-DSP, Librosa. I am using the CMSIS-DSP Python wrapper for testing (pip install cmsisdsp).

The results I am getting:

TF
tf.Tensor(
[-60.15401347  -0.33196457  -6.04726881  -7.64207099  -3.54823378
   1.31988137   3.88698841   2.17443039  -1.33955457  -3.54223212
  -2.35875732   0.7373274    2.93307946], shape=(13,), dtype=float64)
CMSIS-DSP
[-60.154556    -0.33157736  -6.0473366   -7.6420746   -3.5485296
   1.3204467    3.8861852    2.1752558   -1.3400406   -3.5417192
  -2.3583646    0.7381086    2.9335299 ]
Librosa
[-352.07525027   24.55651523  -26.26419471  -30.30838539  -15.41778675
    6.76462123   16.88078724    9.97585971   -5.81466217  -15.06582109
  -10.2487501     3.41145905   12.73910052]

I don't understand all the settings of Librosa. I played with some of them and I can't get the same result as TensorFlow.

I was hoping that the Librosa setting htk=True may help but it did not.

So, it looks like Librosa is using different convention and formula.

For below script, I have started from the TensorFlow example in the documentation and I have reused the same parameters:

import tensorflow as tf 
import cmsisdsp as dsp
import librosa
import numpy as np
import scipy.signal as sig
import cmsisdsp.mfcc as mfcc
from cmsisdsp.datatype import F32
import math 

# https://librosa.org/doc/main/generated/librosa.feature.mfcc.html
# https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms

FFTSize = 1024
numOfDctOutputs = 13

freq_min = 80.0
freq_high = 7600.0
numOfMelFilters = 80
frame_step = 256

num_samples, sample_rate = 32000, 16000.0

t = np.linspace(0,2,num_samples)
pcm = np.array(np.sin(2*math.pi * t * 1000),dtype=np.double)

# A 1024-point STFT with frames of 64 ms and 75% overlap.
stfts = tf.signal.stft(pcm, frame_length=FFTSize, frame_step=frame_step,
                       fft_length=FFTSize,
                       window_fn=tf.signal.hamming_window)
spectrograms = tf.abs(stfts)

# Warp the linear scale spectrograms into the mel-scale.
num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = freq_min, freq_high, numOfMelFilters
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
  num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
  upper_edge_hertz,dtype=tf.double)

#print(spectrograms.dtype)
#print(linear_to_mel_weight_matrix.dtype)
mel_spectrograms = tf.tensordot(
  spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
  linear_to_mel_weight_matrix.shape[-1:]))

# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)

# Compute MFCCs from log_mel_spectrograms and take the first 13.
mfccsT = tf.signal.mfccs_from_log_mel_spectrograms(
  log_mel_spectrograms)[..., :numOfDctOutputs]

mfccsL = librosa.feature.mfcc(y=np.array(pcm),
        sr=sample_rate,
        n_mfcc=numOfDctOutputs,
        n_fft = FFTSize,
        hop_length = frame_step,
        window=sig.hamming(FFTSize, sym=False),
        center=False,            # for padding, not used
        pad_mode='constant',    # for padding, not used
        power=1.0,
        n_mels=numOfMelFilters,
        fmin=freq_min,
        fmax=freq_high,
        dct_type=2,
        norm='ortho',
        htk=True
        )

print("TF")
print(mfccsT[0])

window = sig.hamming(FFTSize, sym=False)

filtLen,filtPos,packedFilters = mfcc.melFilterMatrix(F32,freq_min, freq_high, numOfMelFilters,sample_rate,FFTSize)

dctMatrixFilters = mfcc.dctMatrix(F32,numOfDctOutputs, numOfMelFilters)

mfccf32=dsp.arm_mfcc_instance_f32()

status=dsp.arm_mfcc_init_f32(mfccf32,FFTSize,numOfMelFilters,numOfDctOutputs,dctMatrixFilters,
    filtPos,filtLen,packedFilters,window)

tmp=np.zeros(FFTSize + 2)
res=dsp.arm_mfcc_f32(mfccf32,pcm[0:FFTSize],tmp)

print("CMSIS-DSP")
print(res)

print("Librosa")
print(np.array(mfccsL).T[0])

Yaxit commented 1 year ago

@christophe0606 Thank you for the clarification. I also got the same results with librosa having different results. I'm trying to discover exactly what is different, and why. It would be nice to understand where the difference lies. I'll keep you posted.

ARM-software / CMSIS-DSP

Matching MFCC configuration with Librosa's #54