ar1st0crat / NWaves

.NET DSP library with a lot of audio processing functions
MIT License
453 stars 71 forks source link

python_speech_features mfcc vs NWaves mfcc #53

Closed davidbelle closed 2 years ago

davidbelle commented 2 years ago

Hello.

I've build a Keras model via python and want to use this model in C#. I need to be able to re-create the same MFCC's between the python library verses MWaves version. I've tried lots of options but can't seem to get it.

Below is the options used when training the model. I've also added a comment next to each line indicating what I think the matching field in MWave should be.

mfccs = python_speech_features.base.mfcc(signal,
         samplerate=8000,    
         winlen=0.256,      # Frame Duration, fft/sr
         winstep=0.050,     # Hop Duration, hop_length / sr
         numcep=16,   # Feature Count?
         nfilt=26,          # FilterBankSize
         nfft=2048,         # FFftSize
         preemph=0.0,       # PreEmphasis
         ceplifter=0,       # LifterSize
         appendEnergy=False, # IncludeEnergy
         winfunc=np.hanning) # Window = NWaves.Windows.WindowType.Hann)

The python version returns array[16][16]

But my MWaves equivalent returns array[15][16]. Here's my options

var mfccOptions = new MfccOptions
            {
                SamplingRate = sampleRate,
                FeatureCount = 16,
                FrameDuration = (double)fftSize / sampleRate, 
                HopDuration = 0.05, 
                FftSize = 2048,
                Window = NWaves.Windows.WindowType.Hann,
                IncludeEnergy=false,
                FilterBankSize = 26,
                LifterSize = 0,
                PreEmphasis = 0
            };

Any thoughts on where I might be going wrong?

Thanks

davidbelle commented 2 years ago

Also worth noting, I experimented with the example given on the "Non-expert DSP" page between librosa and nwaves, and i couldn't get those to match either. Below is my code.

sr = 8000
    mfccs = librosa.feature.mfcc(signal, sr, n_mfcc=13,
         dct_type=2, norm='ortho', window='hamming',
         htk=False, n_mels=40, fmin=100, fmax=4000,
         n_fft=1024, hop_length=int(0.010 * sr), center=False)
int sr = 8000;           // sampling rate
            int fftSize = 1024;
            double lowFreq = 100;     // if not specified, will be 0
            double highFreq = 4000;   // if not specified, will be samplingRate / 2
            int filterbankSize = 40;  // or 24 for htk=true (usually)

            // if 'htk=False' in librosa:
            var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr, lowFreq, highFreq);

            // if 'htk' parameter in librosa will be set to True, replace the previous line with these lines:
            // var melBands = FilterBanks.MelBands(filterbankSize, sr, lowFreq, highFreq);
            // var melBankHtk = FilterBanks.Triangular(fftSize, sr, melBands, null, Scale.HerzToMel);

            var opts = new MfccOptions
            {
                SamplingRate = sr,
                FrameDuration = (double)fftSize / sr,
                HopDuration = 0.010,
                FeatureCount = 13,
                FilterBank = melBank,              // or melBankHtk if htk=True
                NonLinearity = NonLinearityType.ToDecibel, // mandatory
                Window = NWaves.Windows.WindowType.Hamming,     // in librosa 'hann' is by default
                LogFloor = 1e-10f,                // mandatory
                DctType = "2N",
                LifterSize = 0
            };

            var extractor = new MfccExtractor(opts);
            var mfccs = extractor.ParallelComputeFrom(signal);

Seems very strange. I have looked at the values for the signal on python and on .NET and they match. They are between -1 an 1.

Would love any insight anyone might have.

ar1st0crat commented 2 years ago

Hi! I'll take a look at python_speech_features a bit later. Meanwhile, you can read this thread regarding librosa nuances.

ar1st0crat commented 2 years ago

It took me more time than I expected, but anyway... Essentially, python_speech_features.base.mfcc is very simple and straightforward. But there are some nuances. Here's the example of NWaves settings:

var mfccOptions = new MfccOptions
            {
                SamplingRate = sampleRate,
                FeatureCount = 16,
                FrameDuration = 2048.0 / sampleRate,
                HopDuration = 0.05,
                FilterBankSize = 26,
                SpectrumType = SpectrumType.PowerNormalized, 
                NonLinearity = NonLinearityType.LogE, 
                DctType = "2N",
                Window = WindowType.Hann,
                FftSize = 2048,
                IncludeEnergy = false
            };

If you run MfccExtractor with these options, you'll get the results that slightly differ from python_speech_features (although pretty close). Don't forget to normalize samples in python version: signal = signal / 32768 (or set normalize: false in WaveFile constructor/loader in NWaves).

Let's compare 31st vectors, for example:

image

Personally I would be OK with these coeffs. But if you need to get as close as possible to python_speech_features, you'll have to add some code.

There are 2 reasons why there are discrepancies:

  1. Mel-filterbank evaluation.
  2. Normalization of power spectra.

In python_speech_features mel-filterbank is constructed differently (in comparison with NWaves,librosa, Kaldi,etc.). I wrote the function that gives identical weights, and you can use it:

float[][] PsfFilterbank(int samplingRate, int filterbankSize, int fftSize, double lowFreq = 0, double highFreq = 0)
        {
            var filterbank = new float[filterbankSize][];

            if (highFreq <= lowFreq)
            {
                highFreq = samplingRate / 2;
            }

            var low = NWaves.Utils.Scale.HerzToMel(lowFreq);
            var high = NWaves.Utils.Scale.HerzToMel(highFreq);

            var res = (fftSize + 1) / (float)samplingRate;

            var bins = Enumerable
                          .Range(0, filterbankSize + 2)
                          .Select(i => (float)Math.Floor(res * NWaves.Utils.Scale.MelToHerz(low + i * (high - low) / (filterbankSize + 1))))
                          .ToArray();

            for (var i = 0; i < filterbankSize; i++)
            {
                filterbank[i] = new float[fftSize / 2 + 1];

                for (var j = (int)bins[i]; j < (int)bins[i + 1]; j++)
                {
                    filterbank[i][j] = (j - bins[i]) / (bins[i + 1] - bins[i]);
                }
                for (var j = (int)bins[i + 1]; j < (int)bins[i + 2]; j++)
                {
                    filterbank[i][j] = (bins[i + 2] - j) / (bins[i + 2] - bins[i + 1]);
                }
            }

            return filterbank;
        }

Now, use it in MFCC options:

var mfccOptions = new MfccOptions
            {
                SamplingRate = sampleRate,
                FeatureCount = 16,
                FrameDuration = 2048.0 / sampleRate,
                HopDuration = 0.05,
                FilterBank = PsfFilterbank(sampleRate, 26, 2048),
                SpectrumType = SpectrumType.PowerNormalized,
                NonLinearity = NonLinearityType.LogE,
                DctType = "2N",
                Window = WindowType.Hann,
                FftSize = 2048,
                IncludeEnergy = false
            };

With these settings everything's good, except the first MFCC coefficient:

image

This is because of different power spectrum normalization in NWaves and python_speech_features. Here's the code compensating this difference:

// call this on already computed mfccVectors:

for (var i = 0; i < mfccVectors.Count; i++)
{
    mfccVectors[i][0] -= (float)(Math.Log(2) * Math.Sqrt(filterbankSize));
}

Now, the first coeff is -48.8676...

If you set appendEnergy=true, the compensation is simpler (although less precise):

// call this on already computed mfccVectors:

for (var i = 0; i < mfccVectors.Count; i++)
{
    mfccVectors[i][0] -= (float)Math.Log(2);
}

PS. Also note that python_speech_features auto-pads the last (incomplete) frame of the signal with zeros. This is why you get 16 MFCC-vectors instead of 15 (as in NWaves). You can simply discard this last vector, or zero-pad signal in NWaves manually to match the behaviour. Personally I prefer the first solution.

davidbelle commented 2 years ago

Amazing work. Thank you. 🙌🙌🙌 I will try it this week and report back. 👍

ar1st0crat commented 2 years ago

MFCC and Mel spectrograms in librosa, kaldi, python_speech_features

Youtube video

Playground (Feature Extractors, MFCC)