VCTK dataset - Githubissues

huazhi1024 commented 8 months ago

Excuse me, may I ask how the training set and validation set of VCTK are divided?

bigpon commented 8 months ago

I randomly selected 172 files from the 28 and 56 speakers' training sets to form the validation set.

bigpon commented 8 months ago

Here are the selected files for the validation set. p282_239.wav p314_058.wav p339_266.wav p323_362.wav p275_161.wav p299_340.wav p326_324.wav p283_156.wav p301_222.wav p233_017.wav p267_060.wav p301_041.wav p258_383.wav p241_157.wav p276_431.wav p244_006.wav p306_237.wav p326_227.wav p231_367.wav p256_320.wav p247_229.wav p284_004.wav p266_319.wav p272_249.wav p249_235.wav p245_334.wav p364_223.wav p335_243.wav p333_380.wav p263_346.wav p282_330.wav p376_173.wav p303_071.wav p376_052.wav p310_279.wav p233_310.wav p243_143.wav p246_150.wav p281_269.wav p275_417.wav p271_329.wav p284_167.wav p274_089.wav p278_267.wav p264_019.wav p304_214.wav p307_090.wav p227_363.wav p308_067.wav p292_234.wav p343_247.wav p277_374.wav p243_161.wav p295_306.wav p287_074.wav p239_272.wav p266_077.wav p312_263.wav p279_212.wav p244_350.wav p333_033.wav p255_085.wav p305_310.wav p343_339.wav p287_171.wav p278_341.wav p248_057.wav p336_300.wav p226_009.wav p246_197.wav p351_289.wav p286_347.wav p303_069.wav p295_424.wav p250_080.wav p306_086.wav p274_400.wav p273_399.wav p230_316.wav p236_278.wav p308_025.wav p277_321.wav p241_260.wav p268_406.wav p336_417.wav p347_109.wav p310_084.wav p281_329.wav p293_069.wav p265_187.wav p316_232.wav p334_047.wav p259_019.wav p339_395.wav p254_179.wav p360_095.wav p265_225.wav p293_105.wav p230_059.wav p228_124.wav p285_241.wav p363_162.wav p226_214.wav p228_235.wav p360_127.wav p259_294.wav p299_223.wav p270_121.wav p269_083.wav p272_114.wav p269_360.wav p363_017.wav p345_131.wav p305_209.wav p237_255.wav p304_387.wav p335_233.wav p258_292.wav p236_158.wav p234_104.wav p270_063.wav p307_151.wav p345_195.wav p254_025.wav p239_112.wav p260_136.wav p286_284.wav p298_346.wav p250_060.wav p255_293.wav p276_039.wav p347_365.wav p267_325.wav p237_009.wav p231_130.wav p341_210.wav p334_111.wav p298_111.wav p314_160.wav p312_050.wav p302_237.wav p374_085.wav p313_259.wav p256_241.wav p247_252.wav p285_348.wav p251_316.wav p279_350.wav p249_051.wav p234_126.wav p263_435.wav p361_198.wav p364_035.wav p316_264.wav p264_112.wav p351_016.wav p283_453.wav p268_328.wav p227_402.wav p251_323.wav p313_370.wav p271_141.wav p260_212.wav p302_207.wav p323_205.wav p341_042.wav p374_101.wav p245_342.wav p273_201.wav p292_036.wav p248_095.wav p361_162.wav

huazhi1024 commented 8 months ago

thanks very much

huazhi1024 commented 6 months ago

I used the clean VCTK dataset and trained the model from scratch using the code you provided. However, I noticed that the synthesized speech sounds poorer than the one obtained directly from your pretrained model. Therefore, I'd like to ask if you used the same data for both Stage 1 and Stage 2 training. For example, was it all VCTK_clean?

bigpon commented 6 months ago

Hi, I used the clean_trainset_28spk_wav.zip clean_trainset_56spk_wav.zip from https://datashare.ed.ac.uk/handle/10283/2791?show=full

Both stage1 and stage2 use the same dataset.

huazhi1024 commented 6 months ago

Thank you very much for your response. In the table below, the left side shows the results obtained by encoder and decoder models using the pre-trained models you provided, while the right side shows the results obtained after training from scratch with the provided code, configuration, and the VCTK_clean dataset. When evaluating these metrics (ViSQOL, PESQ, STOI), both the original speech and the output speech are downsampled to 16kHz. I used an NVIDIA GeForce RTX 3090 24GB single card for training. The model configurations used during inference are as follows:

For symAD: tag_name="autoencoder/symAD_vctk_48000_hop300", encoder_checkpoint=200000; decoder_checkpoint=700000;

For AudioDec_v0: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean" encoder_checkpoint=200000 decoder_checkpoint=500000

For AudioDec_v1: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean" encoder_checkpoint=500000 decoder_checkpoint=500000

For AudioDec_v2: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean" encoder_checkpoint=200000 decoder_checkpoint=500000

I would like to ask if it is normal for the results of my retraining to be slightly lower than the pre-trained results you provided. Additionally, do you approve of me using the results of my retraining as my baseline for future work?

Remaxic commented 6 months ago

@zhanghuiyu123 Hi! May I ask how long it took you to train 700k rounds for one system (e.g. AudioDec_v1) roughly?

huazhi1024 commented 6 months ago

@zhanghuiyu123 Hi! May I ask how long it took you to train 700k rounds for one system (e.g. AudioDec_v1) roughly?

This is the time I spent training different models according to the training sequence provided by the author.

stage 0: training autoencoder from scratch In the corresponding table, it corresponds to "SymAD".

stage 1: statistics extraction

stage 2: training vocoder from scratch for AudioDec_v0:

for AudioDec_v1:

for AudioDec_v2:

Remaxic commented 6 months ago

@zhanghuiyu123 Thank you very much！

bigpon commented 6 months ago

Hi @zhanghuiyu123, I would recommend you mention that you "reimplement AudioDec based on the open-source repo" in your paper to avoid any concerns from the reviewers although I think the differences are minor.

huazhi1024 commented 6 months ago

Hi @zhanghuiyu123, I would recommend you mention that you "reimplement AudioDec based on the open-source repo" in your paper to avoid any concerns from the reviewers although I think the differences are minor.

Thank you for your suggestion! I will mention in the paper that I reimplemented AudioDec based on the open-source repository to address any concerns from the reviewers.

facebookresearch / AudioDec

VCTK dataset #16