auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
996 stars 207 forks source link

What are the full preprocessing steps? #14

Open MostafaOmar98 opened 5 years ago

MostafaOmar98 commented 5 years ago

I'm retraining the model using my own data but my output is all noise. I'm suspecting that I'm having an issue with the way I'm generating the mel-spectrograms. I'm generating them using librosa and inverting the output of the model back to raw audio using librosa too.

Here are the functions I'm using to generate mel-spectrogram from raw audio:

def normalize(S):
    return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)

def denormalize(S):
    return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db

def amp_to_db(x):
    return 20 * np.log10(np.maximum(1e-5, x))

def db_to_amp(x):
    return np.power(10.0, x * 0.05)

def melspectrogram(y):
    S = librosa.feature.melspectrogram(y=y, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, n_mels=hp.n_mels, fmin=hp.fmin, fmax=hp.fmax, power = hp.power)
    S = amp_to_db(S)
    S = normalize(S)
    return S

def inverse_melspectrogram(M):
    M = denormalize(M)
    M = db_to_amp(M)
    y = librosa.feature.inverse.mel_to_audio(M=M, sr=hp.sr, n_fft=hp.fft_size, hop_length=hp.hop_length, power =hp.power)
    return y

Here are the hyperparameters I'm using:

sr=16000  
n_mels=80   
fmin=90  
fmax=7600  
fft_size=1024  
hop_length =256  
min_level_db=-100  
ref_level_db=20  
PAD_VALUE = -100000  
BATCH_SIZE = 32  
MAX_FRAMES = 1024  
power = 1.0  

Could you tell me if there is an issue with my preprocessing steps? If you need any more info, please ask.

Thanks

auspicious3000 commented 5 years ago

What does your input and output spectrogram look like?