Error while parsing silent parts of audio

kwanUm commented 4 years ago

Hi, I'm using the model to extract f0 features in real time for an audio recorded by the microphone.

The process to obtain f0 looks as follows for every 40ms of audio:

def get_praat_f0(audio, rate=16000):
    frame_length = 16.0  # in ms
    to_pad = int(frame_length / 1000 * rate) // 2

    f0s = []
    for y in audio.cpu().numpy().astype(np.float64):
        y_pad = np.pad(y.squeeze(), (to_pad, to_pad), "constant", constant_values=0)
        signal = basic.SignalObj(y_pad, rate)
        pitch = pYAAPT.yaapt(signal, **{'frame_length': frame_length, 'frame_space': 4.0, 'nccf_thresh1': 0.25,
                                        'tda_frame_length': 25.0})
        f0s += [pitch.samp_values[None, None, :]]

    f0 = np.vstack(f0s)
    f0 = torch.from_numpy(f0).float()
    return f0

The model tends to throw warnings and eventually crash after some time. When debugging it, it seems related to feeding it with silent parts of the audio. If I amplify the almost silent audio and send it again to the model, it doesn't crash. Did this behavior happen to anyone else? I'd be grateful for suggestions on how to overcome this :)

Here is the error:

File "/private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py", line 411, in yaapt pitch.set_values(final_pitch, signal.size) File "/private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py", line 103, in set_values self.values = self.upsample(self.samp_values, file_size, 0, 0, File "/private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py", line 228, in upsample tot_interval = np.arange(int(up_interval[0]-(self.frame_jump/2)), IndexError: index 0 is out of bounds for axis 0 with size 0

And here are the warnings that comes before it:

phi[lag_min:lag_max] = formula_nume/np.sqrt(formula_denom) /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py:1001: RuntimeWarning: invalid value encountered in greater vec_back = (phi[lag_min+center:lag_max-center+1] > /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py:1003: RuntimeWarning: invalid value encountered in greater vec_forw = (phi[lag_min+center:lag_max-center+1] > /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py:1005: RuntimeWarning: invalid value encountered in greater above_thresh = phi[lag_min+center:lag_max-center+1] > merit_thresh1 /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3334: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/numpy/core/_methods.py:216: RuntimeWarning: Degrees of freedom <= 0 for slice ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof, /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/numpy/core/_methods.py:185: RuntimeWarning: invalid value encountered in true_divide arrmean = um.true_divide( /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/numpy/core/_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/scipy/signal/signaltools.py:1396: UserWarning: kernel_size exceeds volume extent: the volume will be zero-padded. warnings.warn('kernel_size exceeds volume extent: the volume will be ' /private/home/orik/.conda/envs/speech_enhancement_fromscratch3_py38_pt16/lib/python3.8/site-packages/amfm_decompy/pYAAPT.py:627: RuntimeWarning: invalid value encountered in multiply

bjbschmitt commented 4 years ago

Hi @kwanUm, thank you for your feedback. It seems indeed something related to the treatment of pure silent signals. I will make some tests here to try to reproduce your bug and find a solution for it.

kwanUm commented 4 years ago

Thanks! Here's a tensor that I've sent to the module that have crashed it - https://filebin.net/ddzbx6hiirbm97yg/example_crash_yaapt.pt?t=i6c71w46

Basically calling the code I've shared with audio equals to the tensor I've shared:

def get_praat_f0(audio, rate=16000):
    frame_length = 16.0  # in ms
    to_pad = int(frame_length / 1000 * rate) // 2

    f0s = []
    for y in audio.cpu().numpy().astype(np.float64):
        y_pad = np.pad(y.squeeze(), (to_pad, to_pad), "constant", constant_values=0)
        signal = basic.SignalObj(y_pad, rate)
        pitch = pYAAPT.yaapt(signal, **{'frame_length': frame_length, 'frame_space': 4.0, 'nccf_thresh1': 0.25,
                                        'tda_frame_length': 25.0})
        f0s += [pitch.samp_values[None, None, :]]

    f0 = np.vstack(f0s)
    f0 = torch.from_numpy(f0).float()
    return f0

bjbschmitt commented 4 years ago

Ok, I was able to read your tensor and reproduce the bug. It seems that problem is due the fact that the spectral pitch standard deviation calculated here is equal 0, since all spectral pitch values from voiced frames are equal to 203.125 Hz.

This standard deviation equal to 0 ends up messing not only the frequency threshold, but probably also the spectral search range.

I checked the original Matlab code, and this issue was actually inherit from it. Probably this corner case was not foreseen by the authors. Anyway, I'll test some solutions, probably it will be something like "if standard deviation is lower than X use X as minimum value."

kwanUm commented 4 years ago

That sounds great! Thank you for looking into this!

bjbschmitt commented 4 years ago

Hi, the specific bug in the file that you have sent is fixed. I ended up using a percentage of the average spectral pitch as "standard deviation lower bound". In order to do this, I also introduced a new input parameter called "spec_pitch_min_std", which has default value of 0.05 (i.e., 5%). So, this means that, if the spectral pitch standard deviation is lower than 5% of the spectral pitch mean value, the algorithm uses 0.05*pitch_avg to calculate the freq_threshold.

Actually, in the original YAAPT paper the authors used the whole pitch_avg value to calculate the freq_threshold. But probably later they realized that this value was too big, since that the original code line referring this part in the Matlab tm_trk.m file is currently commented. Therefore, they might had decided to replace the average pitch by its standard deviation. But as I mentioned before, probably they forgot to take in account the cases where this standard deviation could be equal to zero.

Anyway, I uploaded the new AMFM_decompy 1.0.10 version to both PyPi and github repositories. The documentation has also been updated as well. Please upgrade your AMFM_decompy version through pip and check if the problem with this specific file has been really solved.

I believe that there might be a couple more of bugs to be fixed. I just have reread your first message and it seems that the warning messages there refer to something else. Thus, if possible, keep sending your "problematic files" to me, so that I can eliminate these bugs one by one.

kwanUm commented 4 years ago

Sounds good! I'll give 1.0.10 a test and let you know if I still see other warnings.

bjbschmitt commented 3 years ago

Since no other problem were reported, I'll consider that this issue was solved.

bjbschmitt / AMFM_decompy

Error while parsing silent parts of audio #10