bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"
https://bshall.github.io/UniversalVocoding/
MIT License
237 stars 41 forks source link

preprocessing_mel question #18

Closed Kerry0123 closed 4 years ago

Kerry0123 commented 4 years ago

hi,I have doubt about the preprocessing_mel function. I use the following preprocessing method. The generated audio file is muted.

def melspectrogram(wav, hparams): D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams) S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db

np.dot(mel_basis, S)

if hparams.signal_normalization:
    return _normalize(S, hparams)
return S

def _stft(y, hparams): if hparams.use_lws: False return _lws_processor(hparams).stft(y).T else: return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size) librosa.stft(y, n_fft=num_fft, hop_length=hop_length, win_length=win_length) def _linear_to_mel(spectogram, hparams): global _mel_basis if _mel_basis is None: _mel_basis = _build_mel_basis(hparams) return np.dot(_mel_basis, spectogram)

def _amp_to_db(x, hparams): min_level = np.exp(hparams.min_level_db / 20 np.log(10)) return 20 np.log10(np.maximum(min_level, x))

np.exp(-100 / 20 * np.log(10))

min_level = 10**(-100 / 20)
return 20 * np.log10(np.maximum(min_level, x))

def _normalize(S, hparams): if hparams.allow_clipping_in_normalization: (True) if hparams.symmetric_mels: (True) return np.clip((2 hparams.max_abs_value) ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value, -hparams.max_abs_value, hparams.max_abs_value) else: return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)

The main difference is “S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db” and _normalize, hparams.ref_level_db =20, hparams.max_abs_value = 4; data is [-4, 4], your preprocessing data is[0, 1]; the data range has a great influence on the model? I don't understand,I am asking for your help. thank you.

Kerry0123 commented 4 years ago

I am asking for your help. thank you.

bshall commented 4 years ago

Hi @Kerry0123,

Did you retrain the model with your preprocessing steps or did you feed your spectrograms directly to the pretrained model?

Kerry0123 commented 4 years ago

I retrain the model with my preprocessing steps. The loss of epoch 1 is 0.66. Loss will drop to 0. I am asking for your help. thank you.

bshall commented 4 years ago

@Kerry0123, something weird is going on because that loss is very low. What dataset are you using? The ZeroSpeech one? Also, could you share an example spectrogram so I can check if anything is odd?

Kerry0123 commented 4 years ago

The dataset is BZNSYP(Chinese dataset),To align the output of the synthesizer with the input of the vocoder,I use the preprocessing of the tacotron2 synthesizer. Its github link: https://github.com/cnlinxi/style-token_tacotron2. python preprocess.py --dataset=biaobei --base_dir=/tmp-data/data/ --output=/nfs/volume-340-1/tts_data_preprocess/training_data_biaobe. Is it convenient to tell me your email address? I send you mel file. I am asking for your help. thank you.

bshall commented 4 years ago

Sure, you can send it to benjamin.l.van.niekerk@gmail.com

Just to check, you kept all the other preprocessing the same e.g. mu-law encoding and all the padding stuff here?