chrisdonahue / wavegan

WaveGAN: Learn to synthesize raw audio with generative adversarial networks
MIT License
1.32k stars 282 forks source link

Results on 'continuous' speech (recommendations needed too) #76

Open jvel07 opened 4 years ago

jvel07 commented 4 years ago

Hi, I just wanna share the results obtained so far when training wavegan with 'continuous' speech. Description of the data: the dataset consists of 1000 wavs varying from 2 secs to 10 secs long, from different speakers. In each wav, one speaker says one arbitrary short phrase (in German language). I trained the model with: data_slice_len=32768, wave_gan_dim=32; trained it for 89000 iterations. The results look promising, have to be improved tho. There exists some noise and the voice is not clear enough yet (a bit 'robotized'). @chrisdonahue what would be your suggestion in this case?

Here is where you can listen to the generated audio: https://soundcloud.com/jvel07/sets/wave-gans-generated-speech

jvel07 commented 4 years ago

Near 200k iterations and I couldn't get rid of the 'robotized' voice. Somebody has any suggestions, please? :) @chrisdonahue @andimarafioti

andimarafioti commented 4 years ago

What sampling rate are you using? Are you training and generating on different lenghts (2-10 secs) or is the length fixed? To what? Your dataset sounds more complex than the one used in the wavegan experiments, so your results are not completely unexpected.

jvel07 commented 4 years ago

Thanks, @andimarafioti, for your answer. Indeed, I am aware of the fact that my dataset is different from the ones used in your experiments in the way that I inputted variable-length recordings (I read that it is preferable to use fixed lengths, but I wanted to try it out tho). Sample rate is 16k.

spagliarini commented 4 years ago

Which options are you using for the data? E.g. --data_first_slice, --data_pad_end, and so on.

jvel07 commented 4 years ago

Hi, @spagliarini . Those are as follows: data_fast_wav,True data_first_slice,False data_normalize,False data_num_channels,1 data_overlap_ratio,0.0 data_pad_end,False data_prefetch_gpu_num,0 data_sample_rate,16000 data_slice_len,32768

andimarafioti commented 4 years ago

I'm sorry jvel07, I can't really help you much further with this since I'm not so familiar with the wavegan code and results, if you want to try our project that tackles the same problem as wavegan but using a different representation of sound I could help further (https://github.com/tifgan/stftGAN). For what is worth, I would try with smaller slices, maybe half of what you have.

jvel07 commented 4 years ago

@andimarafioti I understand, thanks anyways for your suggestions. Let's wait for this to reach @chrisdonahue Regarding your repo, I haven't used TF representation before. Can they be used to extract, e.g., MFCCs from them?

andimarafioti commented 4 years ago

Actually, MFCC is a TF representation. TF just stands for Time-frequency, as your representation has two dimensions, one for time and another one for frequency.

chrisdonahue commented 4 years ago

@jvel07 the samples you linked are not too far from what I've heard from training WaveGAN on more complex speech datasets (see our paper results on the TIMIT dataset).

One thing that might improve things is increasing the model dimensionality from 32 to 64 (or larger).

You could try adding a post-processing filter with --wavegan_genr_pp which might help with the noise. You might also consider training for longer (I usually trained for 200k iterations or so).

The data loader settings you linked seem fine.

jvel07 commented 4 years ago

Thanks, @chrisdonahue. Will try that out, about the length of that filter, may I keep it on 512?