Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

To keep waveform steady in a piece of long wave file concatenated by serveral sythesized clips #292

Open begeekmyfriend opened 5 years ago

begeekmyfriend commented 5 years ago

As for synthesis, usually we need to concatenate all the synthesized clips to form a long piece of wave file. However one of the problems is that it hard to keep the volumn from different clips in persistence.

I have got some rough workaroud that we can do some non linear scaling for each clip to make weak volumn more scaling and strong volumn less scaling. Here is a rough formulation. 4144511-75afb346252e6639 When k > 1 it so called superlinear and when k < 1 it so called sublinear. In this case I applied sublinear method for each synthesized wav clips.

In audio.py

def save_wav(wav, path, sr):
        # wav = wav * (32767 / max(0.01, np.max(np.abs(wav))))
        # rescale the wave value to unify the measure for all synthesized clips
        wav = wav / np.abs(wav).max() * 0.999
        # factor 0.5 in case of overflow for int16
        f1 = 0.5 * 32767 / max(0.01, np.max(np.abs(wav)))
        # non linear scaling Y ~ X ^ k (k = 0.667)
        f2 = np.sign(wav) * np.power(np.abs(wav), 0.667)
        wav = f1 * f2
        #proposed by @dsmiller
        wavfile.write(path, sr, wav.astype(np.int16))

Let us see the effect. Now it sounds more steady for the evaluation. 8ab54b3f-0944-4c3b-a82a-afe122b23e21 aeb1e0af-3557-4554-bf39-ebf0ee8f6c38 However, when it is applied such scaling, the frequency band out of the range of fmin and fmax is also scaled so that there might be some noises. So we also need to filter the right freqency band. I am also working around with it. Any suggestion is welcome. The current evaluation is provided below. wangdantong_22050.zip

begeekmyfriend commented 5 years ago

Here is another evaluation result, better corpus, sounds very close to WaveNet right? Dirty and violent algorithm is even close to NN vocoder... ad_48000.zip c9a71739-48df-4e73-9c9b-89adcc92f155

begeekmyfriend commented 5 years ago

Add a bandpass, less noises. biaobei_xizang_48000.zip

def save_wav(wav, path, hparams):
        wav = wav / np.abs(wav).max() * 0.999
        f1 = 0.5 * 32767 / max(0.01, np.max(np.abs(wav)))
        f2 = np.sign(wav) * np.power(np.abs(wav), 0.7)
        wav = f1 * f2
        #proposed by @dsmiller
        wav = signal.convolve(wav, signal.firwin(hparams.num_freq, [hparams.fmin, hparams.fmax], pass_zero=False, fs=hparams.sample_rate))
        wavfile.write(path, hparams.sample_rate, wav.astype(np.int16))

7d375fa6-f695-44ef-b4fd-3a74c4e9f85c

superhg2012 commented 5 years ago

any futher progress about the concentrate clips issue? @begeekmyfriend

begeekmyfriend commented 5 years ago

@superhg2012 Here are batch synthesis implementation https://github.com/begeekmyfriend/Tacotron-2/commit/f3bdae8ef26d51fb28b28d5e7413180f144401c1

superhg2012 commented 5 years ago

I am working on concentrating pre-recorded sound clips, the clips are high-quality. After using your code, the synthesized wave is noizy, quality is bad, my samplerate is 8k, how to adjust the parameters with your code? @begeekmyfriend

begeekmyfriend commented 5 years ago
f2 = np.sign(wav) * np.power(np.abs(wav), 1.0)
superhg2012 commented 5 years ago

My process is as below:

step 1 : concentrate two source high quality record clips

step 2 : adjust the synthesized sound with your method

synthesized sound still not clear, what should I do? @begeekmyfriend

def concatenate( wav1, wav2): total_len = len(wav1) + len(wav2) res_wav = np.zeros(total_len) res_wav[:len(wav1)] = wav1 res_wav[len(wav1):] = wav2 return res_wav

   wav = wav / np.abs(wav).max() * 0.999
    f1 = 0.5 * 32767 / max(0.01, np.max(np.abs(wav)))
    f2 = np.sign(wav) * np.power(np.abs(wav), 1.0)
    wav = f1 * f2
    # proposed by @dsmiller
    wav = signal.convolve(wav, signal.firwin(513, [60, 3999], pass_zero=False, fs=fs))