ar1st0crat / NWaves

.NET DSP library with a lot of audio processing functions
MIT License
453 stars 71 forks source link

How to get NWaves MFCC data similar to librosa #48

Closed OrangeOlko closed 3 years ago

OrangeOlko commented 3 years ago

Hello!

I'm trying to get the same array of data in NWaves as in Librosa, I read and tried a lot of settings from wiki but results are not close to desired.

Initial line using librosa was: mfcc = librosa.feature.mfcc(y = audio, sr = sr, n_fft = int(2048/2), hop_length = int(np.floor(len(audio)/20)), n_mfcc = 13)

With just these settings I got array of different length: 273 items in librosa from shape (13, 21) 1685 * 13 in NWaves

Audio length is 371499 in librosa (len(audio)), 134784 in NWaves. File is mono.

var waveFile = new WaveFile(stream);
var left = waveFile[Channels.Left];
// check left.Length

Then I tried to underestand which default parameters were used in librosa. My results stayed the same with these list. By the way if fmax set to any value like sr/2 results are changing. But anyway I didn't find all these parameters in NWaves.


mfcc = librosa.feature.mfcc(y = audio, sr = 8000, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)),
                                    dct_type=2, norm='ortho', htk=False, fmin=0, center=True, n_mels=128, window='hanning')

Could you please advise how to achieve getting data in the same as librosa format?

ar1st0crat commented 3 years ago

Hi!

Check this page

Also, this seems strange:

Audio length is 371499 in librosa (len(audio)), 134784 in NWaves. File is mono.

What is the sampling rate and duration (in seconds) of the signal?

OrangeOlko commented 3 years ago

Thanks! I started from this page but no success.

print(librosa.get_duration(audio)) 16.848027210884354

sampling rate is 8000

ar1st0crat commented 3 years ago

So the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784.

mfcc = librosa.feature.mfcc(y = audio, sr = 8000, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)),
                                    dct_type=2, norm='ortho', htk=False, fmin=0, center=False, n_mels=128, window='hanning')

is equivalent to

int sr = 8000;           // sampling rate
int fftSize = 1024;
int filterbankSize = 128;

var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);

int hopLength = <just specify here the value stored in _ int(np.floor(len(audio)/20)) _ >

var opts = new MfccOptions
{
    SamplingRate = sr,
    FrameDuration = (double)fftSize / sr,
    HopDuration = (double)hopLength / sr,
    FeatureCount = 13,
    Filterbank = melBank, 
    NonLinearity = NonLinearityType.ToDecibel,
    Window = WindowTypes.Hann,
    LogFloor = 1e-10f, 
    DctType="2N",
    LifterSize = 0
};

var extractor = new MfccExtractor(opts);

Note. Set center=False in librosa (as I explained in wiki).

PS. Your hop_length depends on len(audio), so specify its concrete value to avoid confusion.

OrangeOlko commented 3 years ago

Thank you very much for the example!

I took this value from python debug code: print(int(np.floor(len(audio)/20)))

int hopLength = 18574;

Also I set center=False

After all these I got (13, 20) in Python Librosa (13, 8) in NWaves

What else could be wrong?


 var left = waveFile[Channels.Left];
                int sr = 8000;           // sampling rate
                int fftSize = 1024;
                int filterbankSize = 128;

                var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);

                int hopLength = 18574;

                   var opts = new MfccOptions
                   {
                       SamplingRate = sr,
                       FrameDuration = (double)fftSize / sr,
                       HopDuration = (double)hopLength / sr,
                       FeatureCount = 13,
                       FilterBank = melBank,
                       NonLinearity = NonLinearityType.ToDecibel,
                       Window = WindowTypes.Hann,
                       LogFloor = 1e-10f,
                       DctType = "2N",
                       LifterSize = 0
                   };

                var mfccExtractor = new MfccExtractor(opts);
                var mfccVectors = mfccExtractor.ComputeFrom(left);
ar1st0crat commented 3 years ago

You need to find out why librosa returns 371499 samples. Because

the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784.

Also, do you understand what the hop_length is (both in librosa and in NWaves)? Currently you're trying to extract 20 short frames from a relatively long signal, and the distance between 2 adjacent frames is quite big as well (it's very unusual scenario)

UPD. Seems like the signal is resampled at 22050 Hz during loading.

According to librosa docs

Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.

OrangeOlko commented 3 years ago

Thanks, I will try to find out

ar1st0crat commented 3 years ago

I've already found out (see my previous comment):

Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.

Simply set: librosa.load(..., sr=None)

OrangeOlko commented 3 years ago

Thanks again for help! You are right about sr=None, so now I have arrays of the same size in librosa and in NWaves, but data is different inside. I tried to change all parameters but none of then brought me better result.

image

mfcc = librosa.feature.mfcc(y = audio, sr = sr, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)), dct_type=2, norm='ortho', htk=False,fmin=0,center=False, n_mels=128, window='hanning')

image

ar1st0crat commented 3 years ago

You need to analyze the results more carefully. Compare them frame by frame. For example, here are the results of my experiments:

image

The values are very slightly different, and this is because of round-off errors. As we can see, the algorithm is implemented correctly. In the first frame of you signal (and many others as well) the first coeff seems very different, because the corresponding frame contains silence (sample values are very close to 0); essentially, in this case you have some big value in mfcc_0 and zeros in other coeffs (NWaves shows you 10e-5... 10e-7, but basically they are zeros); anyway, frames containing silence, most likely, will be discarded during feature analysis.

Also, read more about: 1) the first MFCC coeff; what to do with it; 2) filter banks and their settings (usually, 24 - 40 bands are enough; I don't know why librosa sets 128 by default); 3) window analysis and what is frame size / hop size

OrangeOlko commented 3 years ago

Thank you very much for the details, I will investigate this!

OrangeOlko commented 3 years ago

I wanted to post final solution and found errors in my code which might help others.

  1. Audio file was 8 bit so librosa and windows libraries results were different in reading samples. File was converted to 16 bit.
  2. Librosa by default loads data as float64, though for this file float32 is needed to get the same results. So call to load file was changed to use soundfile directly to change type parameter:
    import soundfile as sf
    audio, sr = sf.read(filename, dtype='float32')
  3. As @ar1st0crat mentioned by default sample rate is set to 22500 in librosa, so sr = None can be applied while reading. By in my case we used soundfile which doesn't resample audio so no need in this parameter.
  4. As first mfcc frame doesn't contain relevant information it can be omitted. But even it now contains similar results.

Librosa code:

mfcc = librosa.feature.mfcc(y = audio, sr = sr, center=False, hop_length=int(np.floor(len(audio)/ 20)), 
                                 n_mels=128,  n_fft = 1024, n_mfcc = 20, fmax = 4000, fmin = 0,norm = 'ortho',
                               window = 'hann', htk = False, power=1, dct_type=2)

NWaves code:

int fftSize = 1024;
int filterbankSize = 128;
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
var hopCount = 20;
var hopLength = chunkData.Count / hopCount;
var opts = new MfccOptions
{
    SamplingRate = sr,
    FrameDuration = (double)((double)fftSize / (double)sr),
    HopDuration = ((double)(double)hopLength / (double)sr),
    FeatureCount = 20,
    FilterBank = melBank,
    NonLinearity = NonLinearityType.ToDecibel,
    Window = WindowTypes.Hann,
    LogFloor = 1e-10f,
    DctType = "2N",
    LifterSize = 0,
    FftSize = fftSize,
    HighFrequency =4000,
    SpectrumType = SpectrumType.Magnitude
};

var mfccExtractor = new MfccExtractor(opts);
var mfccVectors = mfccExtractor.ComputeFrom(chunkData.ToArray());
var mfccFlatten = new List<float>();

// remove 1 mfcc
for (int m = 1; m < 20; m++)
    for (int hop = hopCount; hop > 0; hop--)
        mfccFlatten.Add(mfccVectors[hopCount - hop][m]);

Results using NWaves: image

Results using Librosa: image