Tools to split VCTK audio

auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

https://arxiv.org/abs/1905.05879

MIT License

983 stars 207 forks source link

Tools to split VCTK audio #39

Closed YichongLeng closed 4 years ago

YichongLeng commented 4 years ago

From your demo, it seems that some tools are used to split origin VCTK audio. Can you please share the tools?

DatanIMU commented 4 years ago

Do you mean VAD?

YichongLeng commented 4 years ago

In my understanding, VAD can split audio by silence. But how can we split the transcript too while keeping the split result of audio and transcript aligned? I wonder what is the tools used.

auspicious3000 commented 4 years ago

Human

YichongLeng commented 4 years ago

Get~ Thanks for your reply~

ma1112 commented 4 years ago

Up to my understanding AutoVC is a non-parallel training method, hence needs no transcript to train. I wonder why are you concerned about using / splitting the transcript at all.

DatanIMU commented 4 years ago

agree, only TTS needs word alignments.

syyuan1993 commented 4 years ago

@auspicious3000 Thanks for great work and code. I tried to run the code you sent through email with 20 speakers of VCTK data and the model did not converge. My primary guessing was about preprocessing. I tried with deleting silence starts and ends of audios. It doesn't really work. What did you do for data preprocessing?

Thanks

auspicious3000 commented 4 years ago

@syyuan1993 You can find all the details in the data preprocessing code.

syyuan1993 commented 4 years ago

@auspicious3000 Sure. But the data samples you provided are different from the ones from VCTK dataset. For example, the silence at the start and end of all audios are removed. How did you remove the silence pieces? Did you do any more preprocessing to the .wav data? Thanks a lot!

auspicious3000 commented 4 years ago

@syyuan1993 That should not matter. No special preprocessing is required.

syyuan1993 commented 4 years ago

@auspicious3000 Thanks for the information. But in my experiments, the model doesn't converge once I use more audio data, for example, 20 speakers with 200 audios each person. Any suggestions?