WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
362 stars 98 forks source link

Librosa version requirements #59

Closed go2chayan closed 4 years ago

go2chayan commented 4 years ago

While trying to run the code, I used librosa 0.4.2 because that's the latest one matching with other specified dependencies (Python 2.7.15, Keras 2.2.4, Tensorflow 1.8.0). But it is showing the following error:

Traceback (most recent call last):
  File "/home/iftekhart/VGG-Speaker-Recognition/src/utils.py", line 2, in <module>
    import librosa
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/__init__.py", line 15, in <module>
    from . import core
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/core/__init__.py", line 88, in <module>
    from .time_frequency import *  # pylint: disable=wildcard-import
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/core/time_frequency.py", line 9, in <module>
    from ..util.exceptions import ParameterError
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/util/__init__.py", line 73, in <module>
    from .utils import *  # pylint: disable=wildcard-import
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/util/utils.py", line 105, in <module>
    def valid_audio(y, mono=True):
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/librosa/cache.py", line 36, in __call__
    if self.cachedir is not None:
  File "/home/iftekhart/.pyenv/versions/VGG_env/lib/python2.7/site-packages/joblib/memory.py", line 918, in cachedir
    DeprecationWarning, stacklevel=2)
TypeError: expected string or buffer

So, I'm wondering if I'm installing the correct librosa version or if there is anything that I didn't get correctly. Would you please help?

WeidiXie commented 4 years ago

Hi, I think the librosa is a problem, they seem to have updated the internal functions.

I would recommend to change to something else for loading data, e.g. scipy.

bml1g12 commented 3 years ago

Seems like 2.7.15, Keras 2.2.4, Tensorflow 1.8.0, scikit-learn==0.16.1, and librosa==0.3.1 seems to work; hopefully its the same as the used in the paper

I think switching away from librosa to scipy can be risky, as there are differences in the way it can convert them to numpy arrays depending on the format of the audio (32 bit PCM etc.)

bml1g12 commented 3 years ago

ah but the librosa dataloader is incredibly slow for some reason here; taking about 1 second per short .wav clip

def load_wav(vid_path, sr, mode='train'):

    t1=timelib.time()
    #print("start loading wav")
    #print(sr)
    #wav, sr_ret = librosa.load(vid_path, sr=sr)
    sr_ret, old_audio = scipy.io.wavfile.read(vid_path)

    NEW_SAMPLERATE = 16000

    if sr_ret != sr:
        duration = old_audio.shape[0] / sr_ret

        time_old = np.linspace(0, duration, old_audio.shape[0])
        time_new = np.linspace(0, duration,
                               int(old_audio.shape[0] * sr / sr_ret))

        interpolator = interpolate.interp1d(time_old, old_audio.T)
        wav = interpolator(time_new).T

    #assert sr_ret == 16000, "we need same samplerate as librosa originally provided but is: " +str(sr_ret)
    #print("finish loading wav", timelib.time()-t1)
    if mode == 'train':
        extended_wav = np.append(wav, wav)
        if np.random.random() < 0.3:
            extended_wav = extended_wav[::-1]
        return extended_wav
    else:
        extended_wav = np.append(wav, wav[::-1])
        return extended_wav

Seems to fix this making it 0.003 seconds to load, and ensure the sample rate is still == "sr" like librosa would

Unfortunately this had issues reading my particular WAV files, saying it could not read certain chunks, so I then tried the following that seemed to work

import soundfile as sf
def load_data(path, win_length=400, sr=16000, hop_length=160, n_fft=512, spec_len=250, mode='train'):
    #print("starting loading a datum")
    #t1 = timelib.time()
    wav = load_wav(path, sr=sr, mode=mode)
    linear_spect = lin_spectogram_from_wav(wav, hop_length, win_length, n_fft)
    mag, _ = librosa.magphase(linear_spect)  # magnitude
    mag_T = mag.T
    freq, time = mag_T.shape
    if mode == 'train':
        if time > spec_len:
            randtime = np.random.randint(0, time-spec_len)
            spec_mag = mag_T[:, randtime:randtime+spec_len]
        else:
            spec_mag = np.pad(mag_T, ((0, 0), (0, spec_len - time)), 'constant')
    else:
        spec_mag = mag_T
    # preprocessing, subtract mean, divided by time-wise var
    mu = np.mean(spec_mag, 0, keepdims=True)
    std = np.std(spec_mag, 0, keepdims=True)
    #print("finished loading a datum", timelib.time() - t1)
    return (spec_mag - mu) / (std + 1e-5)

But this NAN encountered division issues on the interpolation step so ended up just accepting variable samples rates with a simple wav, sr_ret = sf.read(vid_path) and hoping it doesn't break anythig

go2chayan commented 3 years ago

I found resampling using an fft method is too time consuming. I finally used scipy.waveread followed by resample_ploy for a faster processing. https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample_poly.html

bml1g12 commented 3 years ago

I found resampling using an fft method is too time consuming. I finally used scipy.waveread followed by resample_ploy for a faster processing. https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample_poly.html

Interesting! Please could you share the code snippet for using resample_poly for audio resamplng?