auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
976 stars 207 forks source link

len_crop issue when train with VCTK dataset #107

Open junseokoh1 opened 2 years ago

junseokoh1 commented 2 years ago

Thanks for your nice project and let us to use.

I guess the pre-trained model that this project give is only trained for validation with only 4 data(p225, p226, p227, p228)

So to reproduce the result of paper, i download VCTK dataset and as paper say i pick 40 speaker (40 speakers for zero-shot train), which is p225 ~ p269 ( maybe 40 ~ 43 speaker, also i change sampling rate to 16k)

When making train.pkl with make_matadata.py there is error with len_crop.

ValueError: 'a' cannot be empty unless no samples are taken

The default value of len_crop is 128, but there is shorter data in VCTK.

I want to know whether you just remove the len_crop part or just use VCTK data which is longer then 128.

Thanks for reading issue.

junseokoh1 commented 2 years ago

Also there is about 400 utterance for each speaker

At make_metadata.py code, to use all data, i replace https://github.com/auspicious3000/autovc/blob/79dda70cff8e4e15e634f64dd7364c6a090b799b/make_metadata.py#L42-L47 to melsp = torch.from_numpy(tmp[np.newaxis, :, :]).cuda()

auspicious3000 commented 2 years ago

You could just skip the shorter utterances.