Open Yaxit opened 2 years ago
@Yaxit It was tested with the MFCC from TensorFlow using Hamming window
If Librosa is a reference then it may be worth looking at it for a future improvement.
The dct_type
and norm
look coherent with the TensorFlow version : https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms
From TensorFlow documentation:
Mel-Frequency Cepstral Coefficient (MFCC) calculation consists of taking the DCT-II of a log-magnitude mel-scale spectrogram. HTK's MFCCs use a particular scaling of the DCT-II which is almost orthogonal normalization. We follow this convention.
I think power
should be 1.
@Yaxit I have made some tests. Here is a Python script to compute the 3 versions : TensorFlow, CMSIS-DSP, Librosa. I am using the CMSIS-DSP Python wrapper for testing (pip install cmsisdsp
).
The results I am getting:
TF
tf.Tensor(
[-60.15401347 -0.33196457 -6.04726881 -7.64207099 -3.54823378
1.31988137 3.88698841 2.17443039 -1.33955457 -3.54223212
-2.35875732 0.7373274 2.93307946], shape=(13,), dtype=float64)
CMSIS-DSP
[-60.154556 -0.33157736 -6.0473366 -7.6420746 -3.5485296
1.3204467 3.8861852 2.1752558 -1.3400406 -3.5417192
-2.3583646 0.7381086 2.9335299 ]
Librosa
[-352.07525027 24.55651523 -26.26419471 -30.30838539 -15.41778675
6.76462123 16.88078724 9.97585971 -5.81466217 -15.06582109
-10.2487501 3.41145905 12.73910052]
I don't understand all the settings of Librosa. I played with some of them and I can't get the same result as TensorFlow.
I was hoping that the Librosa setting htk=True
may help but it did not.
So, it looks like Librosa is using different convention and formula.
For below script, I have started from the TensorFlow example in the documentation and I have reused the same parameters:
import tensorflow as tf
import cmsisdsp as dsp
import librosa
import numpy as np
import scipy.signal as sig
import cmsisdsp.mfcc as mfcc
from cmsisdsp.datatype import F32
import math
# https://librosa.org/doc/main/generated/librosa.feature.mfcc.html
# https://www.tensorflow.org/api_docs/python/tf/signal/mfccs_from_log_mel_spectrograms
FFTSize = 1024
numOfDctOutputs = 13
freq_min = 80.0
freq_high = 7600.0
numOfMelFilters = 80
frame_step = 256
num_samples, sample_rate = 32000, 16000.0
t = np.linspace(0,2,num_samples)
pcm = np.array(np.sin(2*math.pi * t * 1000),dtype=np.double)
# A 1024-point STFT with frames of 64 ms and 75% overlap.
stfts = tf.signal.stft(pcm, frame_length=FFTSize, frame_step=frame_step,
fft_length=FFTSize,
window_fn=tf.signal.hamming_window)
spectrograms = tf.abs(stfts)
# Warp the linear scale spectrograms into the mel-scale.
num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = freq_min, freq_high, numOfMelFilters
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
upper_edge_hertz,dtype=tf.double)
#print(spectrograms.dtype)
#print(linear_to_mel_weight_matrix.dtype)
mel_spectrograms = tf.tensordot(
spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(
linear_to_mel_weight_matrix.shape[-1:]))
# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
# Compute MFCCs from log_mel_spectrograms and take the first 13.
mfccsT = tf.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :numOfDctOutputs]
mfccsL = librosa.feature.mfcc(y=np.array(pcm),
sr=sample_rate,
n_mfcc=numOfDctOutputs,
n_fft = FFTSize,
hop_length = frame_step,
window=sig.hamming(FFTSize, sym=False),
center=False, # for padding, not used
pad_mode='constant', # for padding, not used
power=1.0,
n_mels=numOfMelFilters,
fmin=freq_min,
fmax=freq_high,
dct_type=2,
norm='ortho',
htk=True
)
print("TF")
print(mfccsT[0])
window = sig.hamming(FFTSize, sym=False)
filtLen,filtPos,packedFilters = mfcc.melFilterMatrix(F32,freq_min, freq_high, numOfMelFilters,sample_rate,FFTSize)
dctMatrixFilters = mfcc.dctMatrix(F32,numOfDctOutputs, numOfMelFilters)
mfccf32=dsp.arm_mfcc_instance_f32()
status=dsp.arm_mfcc_init_f32(mfccf32,FFTSize,numOfMelFilters,numOfDctOutputs,dctMatrixFilters,
filtPos,filtLen,packedFilters,window)
tmp=np.zeros(FFTSize + 2)
res=dsp.arm_mfcc_f32(mfccf32,pcm[0:FFTSize],tmp)
print("CMSIS-DSP")
print(res)
print("Librosa")
print(np.array(mfccsL).T[0])
@christophe0606 Thank you for the clarification. I also got the same results with librosa having different results. I'm trying to discover exactly what is different, and why. It would be nice to understand where the difference lies. I'll keep you posted.
Hello,
I am trying to replicate the MFCC output of Librosa, which is widely used as the reference library for audio manipulation.
On my ARM microcontroller, I am using the
arm_mfcc_f32
callback and thearm_mfcc_init_f32
to initialize the parameters. The parameters have been generated with the scripts from this repository.Generally, I would expect some small deviation from the results on the two platforms, since CMSIS uses approximated calculations for logarithms, etc; however, I found that the outputs are completely different (one order of magnitude, sign).
I also could not find a clear explanation of which parameters are used in the CMSIS implementation for
power
,dct_type
, andnorm
. Am I missing some steps? Or is this expected altogether?I believe this would be quite important to clarify, to allow people to design a data pipeline that can be quite reproducible in the microcontroller :)