dhchoi99 / NANSY

161 stars 20 forks source link

Question about the effect of vc #10

Open tobefans opened 2 years ago

tobefans commented 2 years ago

I ran the code once using vctk, but the conversion didn't work well. Is there any data preprocessing needed? Like VAD? I often see the warning: "PraatWarning: There were no voiced segments found."

dhchoi99 commented 2 years ago

There are 2 major preprocessing that the authors used in the original paper:

  1. Information perturbation using Parselmouth

    To this end, we propose to perturb the information included in input waveform x by using three functions that are 1. formant shifting (fs), 2. pitch randomization (pr), and 3. random frequency shaping using a parametric equalizer (peq)

  2. Dataset filtering

    The speakers of train-clean-360 were included to the training set only when the total length of speech samples exceeds 15 minutes.

For process 2, I've not done any work considering that, so those filtering might help. For process 1, where the warning "PraatWarning: There were no voiced segments found." comes from, the problem is quite complex. During the process(with my implementation), many different praat and parselmouth errors popped out and I couldn't really find out what the exact reasons were. As an example, for "PraatWarning: There were no voiced segments found.", some wavfiles definitely had human voice, but throwed such warning during perturbation :( So I ignored and forced to train with the warning, but it might help if you remove audio files throwing those warnings.