Open spagliarini opened 4 years ago
The raw recordings I used were up to several minutes long. I used a spectral energy heuristic to extract a few 1.5 second chunks from each raw recording which had the most energy. I used these as the training examples (with the random shuffle flag enabled so that there would be a bit of phase jitter each time they were presented to the GAN)
Until now, I have tried to train WaveGAN using two different datasets:
Single syllables, i.e. I selected the syllables one by one and then I added silence up to 1 second (or 250 milliseconds): I did have some promising result but I run into the "loss value problem" I was describing in #64 .
One second chunks of the recordings, using an adaptative sliding window that generates overlapping in the dataset (I built them selecting some syllable and then taking the chunks starting from them). In this case, #64 is not an issue anymore, so I suspect that the length of the chunks matters. Have you ever trained WaveGAN using very short data (shorter than SC09)?
I have not tried training WaveGAN on slices shorter than 16384 samples. For me, one second clips were already verging on unsatisfyingly short in length. Good to hear that it results in more stable training however!
Hi,
For the dataset on birds, I read that the total length of the dataset is 12.2 hours but I'm interested in the characteristics of each single recording used for the training. Do they have a duration of 1 second as the speech dataset? Or are they differently pre-processed?
Thank you for the availability!