Closed OrangeOlko closed 3 years ago
Hi!
Also, this seems strange:
Audio length is 371499 in librosa (len(audio)), 134784 in NWaves. File is mono.
What is the sampling rate and duration (in seconds) of the signal?
Thanks! I started from this page but no success.
print(librosa.get_duration(audio))
16.848027210884354
sampling rate is 8000
So the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784
.
mfcc = librosa.feature.mfcc(y = audio, sr = 8000, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)),
dct_type=2, norm='ortho', htk=False, fmin=0, center=False, n_mels=128, window='hanning')
is equivalent to
int sr = 8000; // sampling rate
int fftSize = 1024;
int filterbankSize = 128;
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
int hopLength = <just specify here the value stored in _ int(np.floor(len(audio)/20)) _ >
var opts = new MfccOptions
{
SamplingRate = sr,
FrameDuration = (double)fftSize / sr,
HopDuration = (double)hopLength / sr,
FeatureCount = 13,
Filterbank = melBank,
NonLinearity = NonLinearityType.ToDecibel,
Window = WindowTypes.Hann,
LogFloor = 1e-10f,
DctType="2N",
LifterSize = 0
};
var extractor = new MfccExtractor(opts);
Note. Set center=False
in librosa (as I explained in wiki).
PS. Your hop_length
depends on len(audio)
, so specify its concrete value to avoid confusion.
Thank you very much for the example!
I took this value from python debug code:
print(int(np.floor(len(audio)/20)))
int hopLength = 18574;
Also I set center=False
After all these I got (13, 20) in Python Librosa (13, 8) in NWaves
What else could be wrong?
var left = waveFile[Channels.Left];
int sr = 8000; // sampling rate
int fftSize = 1024;
int filterbankSize = 128;
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
int hopLength = 18574;
var opts = new MfccOptions
{
SamplingRate = sr,
FrameDuration = (double)fftSize / sr,
HopDuration = (double)hopLength / sr,
FeatureCount = 13,
FilterBank = melBank,
NonLinearity = NonLinearityType.ToDecibel,
Window = WindowTypes.Hann,
LogFloor = 1e-10f,
DctType = "2N",
LifterSize = 0
};
var mfccExtractor = new MfccExtractor(opts);
var mfccVectors = mfccExtractor.ComputeFrom(left);
You need to find out why librosa returns 371499 samples. Because
the number of samples should be, indeed, int(16.848027210884354 * 8000) = 134784.
Also, do you understand what the hop_length
is (both in librosa and in NWaves)? Currently you're trying to extract 20 short frames from a relatively long signal, and the distance between 2 adjacent frames is quite big as well (it's very unusual scenario)
UPD. Seems like the signal is resampled at 22050 Hz during loading.
According to librosa docs
Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.
Thanks, I will try to find out
I've already found out (see my previous comment):
Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.
Simply set: librosa.load(..., sr=None)
Thanks again for help! You are right about sr=None, so now I have arrays of the same size in librosa and in NWaves, but data is different inside. I tried to change all parameters but none of then brought me better result.
mfcc = librosa.feature.mfcc(y = audio, sr = sr, n_mfcc=13, n_fft=1024, hop_length=int(np.floor(len(audio)/20)), dct_type=2, norm='ortho', htk=False,fmin=0,center=False, n_mels=128, window='hanning')
You need to analyze the results more carefully. Compare them frame by frame. For example, here are the results of my experiments:
The values are very slightly different, and this is because of round-off errors. As we can see, the algorithm is implemented correctly. In the first frame of you signal (and many others as well) the first coeff seems very different, because the corresponding frame contains silence (sample values are very close to 0); essentially, in this case you have some big value in mfcc_0 and zeros in other coeffs (NWaves shows you 10e-5... 10e-7, but basically they are zeros); anyway, frames containing silence, most likely, will be discarded during feature analysis.
Also, read more about: 1) the first MFCC coeff; what to do with it; 2) filter banks and their settings (usually, 24 - 40 bands are enough; I don't know why librosa sets 128 by default); 3) window analysis and what is frame size / hop size
Thank you very much for the details, I will investigate this!
I wanted to post final solution and found errors in my code which might help others.
import soundfile as sf
audio, sr = sf.read(filename, dtype='float32')
Librosa code:
mfcc = librosa.feature.mfcc(y = audio, sr = sr, center=False, hop_length=int(np.floor(len(audio)/ 20)),
n_mels=128, n_fft = 1024, n_mfcc = 20, fmax = 4000, fmin = 0,norm = 'ortho',
window = 'hann', htk = False, power=1, dct_type=2)
NWaves code:
int fftSize = 1024;
int filterbankSize = 128;
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr);
var hopCount = 20;
var hopLength = chunkData.Count / hopCount;
var opts = new MfccOptions
{
SamplingRate = sr,
FrameDuration = (double)((double)fftSize / (double)sr),
HopDuration = ((double)(double)hopLength / (double)sr),
FeatureCount = 20,
FilterBank = melBank,
NonLinearity = NonLinearityType.ToDecibel,
Window = WindowTypes.Hann,
LogFloor = 1e-10f,
DctType = "2N",
LifterSize = 0,
FftSize = fftSize,
HighFrequency =4000,
SpectrumType = SpectrumType.Magnitude
};
var mfccExtractor = new MfccExtractor(opts);
var mfccVectors = mfccExtractor.ComputeFrom(chunkData.ToArray());
var mfccFlatten = new List<float>();
// remove 1 mfcc
for (int m = 1; m < 20; m++)
for (int hop = hopCount; hop > 0; hop--)
mfccFlatten.Add(mfccVectors[hopCount - hop][m]);
Results using NWaves:
Results using Librosa:
Hello!
I'm trying to get the same array of data in NWaves as in Librosa, I read and tried a lot of settings from wiki but results are not close to desired.
Initial line using librosa was:
mfcc = librosa.feature.mfcc(y = audio, sr = sr, n_fft = int(2048/2), hop_length = int(np.floor(len(audio)/20)), n_mfcc = 13)
With just these settings I got array of different length: 273 items in librosa from shape (13, 21) 1685 * 13 in NWaves
Audio length is 371499 in librosa (len(audio)), 134784 in NWaves. File is mono.
Then I tried to underestand which default parameters were used in librosa. My results stayed the same with these list. By the way if fmax set to any value like sr/2 results are changing. But anyway I didn't find all these parameters in NWaves.
Could you please advise how to achieve getting data in the same as librosa format?