Open tobefans opened 2 years ago
There are 2 major preprocessing that the authors used in the original paper:
To this end, we propose to perturb the information included in input waveform x by using three functions that are 1. formant shifting (fs), 2. pitch randomization (pr), and 3. random frequency shaping using a parametric equalizer (peq)
The speakers of train-clean-360 were included to the training set only when the total length of speech samples exceeds 15 minutes.
For process 2, I've not done any work considering that, so those filtering might help. For process 1, where the warning "PraatWarning: There were no voiced segments found." comes from, the problem is quite complex. During the process(with my implementation), many different praat and parselmouth errors popped out and I couldn't really find out what the exact reasons were. As an example, for "PraatWarning: There were no voiced segments found.", some wavfiles definitely had human voice, but throwed such warning during perturbation :( So I ignored and forced to train with the warning, but it might help if you remove audio files throwing those warnings.
I ran the code once using vctk, but the conversion didn't work well. Is there any data preprocessing needed? Like VAD? I often see the warning: "PraatWarning: There were no voiced segments found."