Closed huazhi1024 closed 4 months ago
I randomly selected 172 files from the 28 and 56 speakers' training sets to form the validation set.
Here are the selected files for the validation set. p282_239.wav p314_058.wav p339_266.wav p323_362.wav p275_161.wav p299_340.wav p326_324.wav p283_156.wav p301_222.wav p233_017.wav p267_060.wav p301_041.wav p258_383.wav p241_157.wav p276_431.wav p244_006.wav p306_237.wav p326_227.wav p231_367.wav p256_320.wav p247_229.wav p284_004.wav p266_319.wav p272_249.wav p249_235.wav p245_334.wav p364_223.wav p335_243.wav p333_380.wav p263_346.wav p282_330.wav p376_173.wav p303_071.wav p376_052.wav p310_279.wav p233_310.wav p243_143.wav p246_150.wav p281_269.wav p275_417.wav p271_329.wav p284_167.wav p274_089.wav p278_267.wav p264_019.wav p304_214.wav p307_090.wav p227_363.wav p308_067.wav p292_234.wav p343_247.wav p277_374.wav p243_161.wav p295_306.wav p287_074.wav p239_272.wav p266_077.wav p312_263.wav p279_212.wav p244_350.wav p333_033.wav p255_085.wav p305_310.wav p343_339.wav p287_171.wav p278_341.wav p248_057.wav p336_300.wav p226_009.wav p246_197.wav p351_289.wav p286_347.wav p303_069.wav p295_424.wav p250_080.wav p306_086.wav p274_400.wav p273_399.wav p230_316.wav p236_278.wav p308_025.wav p277_321.wav p241_260.wav p268_406.wav p336_417.wav p347_109.wav p310_084.wav p281_329.wav p293_069.wav p265_187.wav p316_232.wav p334_047.wav p259_019.wav p339_395.wav p254_179.wav p360_095.wav p265_225.wav p293_105.wav p230_059.wav p228_124.wav p285_241.wav p363_162.wav p226_214.wav p228_235.wav p360_127.wav p259_294.wav p299_223.wav p270_121.wav p269_083.wav p272_114.wav p269_360.wav p363_017.wav p345_131.wav p305_209.wav p237_255.wav p304_387.wav p335_233.wav p258_292.wav p236_158.wav p234_104.wav p270_063.wav p307_151.wav p345_195.wav p254_025.wav p239_112.wav p260_136.wav p286_284.wav p298_346.wav p250_060.wav p255_293.wav p276_039.wav p347_365.wav p267_325.wav p237_009.wav p231_130.wav p341_210.wav p334_111.wav p298_111.wav p314_160.wav p312_050.wav p302_237.wav p374_085.wav p313_259.wav p256_241.wav p247_252.wav p285_348.wav p251_316.wav p279_350.wav p249_051.wav p234_126.wav p263_435.wav p361_198.wav p364_035.wav p316_264.wav p264_112.wav p351_016.wav p283_453.wav p268_328.wav p227_402.wav p251_323.wav p313_370.wav p271_141.wav p260_212.wav p302_207.wav p323_205.wav p341_042.wav p374_101.wav p245_342.wav p273_201.wav p292_036.wav p248_095.wav p361_162.wav
thanks very much
I used the clean VCTK dataset and trained the model from scratch using the code you provided. However, I noticed that the synthesized speech sounds poorer than the one obtained directly from your pretrained model. Therefore, I'd like to ask if you used the same data for both Stage 1 and Stage 2 training. For example, was it all VCTK_clean?
Hi, I used the clean_trainset_28spk_wav.zip clean_trainset_56spk_wav.zip from https://datashare.ed.ac.uk/handle/10283/2791?show=full
Both stage1 and stage2 use the same dataset.
Thank you very much for your response. In the table below, the left side shows the results obtained by encoder and decoder models using the pre-trained models you provided, while the right side shows the results obtained after training from scratch with the provided code, configuration, and the VCTK_clean dataset. When evaluating these metrics (ViSQOL, PESQ, STOI), both the original speech and the output speech are downsampled to 16kHz. I used an NVIDIA GeForce RTX 3090 24GB single card for training. The model configurations used during inference are as follows:
For symAD: tag_name="autoencoder/symAD_vctk_48000_hop300", encoder_checkpoint=200000; decoder_checkpoint=700000;
For AudioDec_v0: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean" encoder_checkpoint=200000 decoder_checkpoint=500000
For AudioDec_v1: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean" encoder_checkpoint=500000 decoder_checkpoint=500000
For AudioDec_v2: autoencoder=autoencoder/symAD_vctk_48000_hop300 tag_name="vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean" encoder_checkpoint=200000 decoder_checkpoint=500000
I would like to ask if it is normal for the results of my retraining to be slightly lower than the pre-trained results you provided. Additionally, do you approve of me using the results of my retraining as my baseline for future work?
@zhanghuiyu123 Hi! May I ask how long it took you to train 700k rounds for one system (e.g. AudioDec_v1) roughly?
@zhanghuiyu123 Hi! May I ask how long it took you to train 700k rounds for one system (e.g. AudioDec_v1) roughly?
This is the time I spent training different models according to the training sequence provided by the author.
stage 0: training autoencoder from scratch In the corresponding table, it corresponds to "SymAD".
stage 1: statistics extraction
stage 2: training vocoder from scratch for AudioDec_v0:
for AudioDec_v1:
for AudioDec_v2:
@zhanghuiyu123 Thank you very much!
Hi @zhanghuiyu123, I would recommend you mention that you "reimplement AudioDec based on the open-source repo" in your paper to avoid any concerns from the reviewers although I think the differences are minor.
Hi @zhanghuiyu123, I would recommend you mention that you "reimplement AudioDec based on the open-source repo" in your paper to avoid any concerns from the reviewers although I think the differences are minor.
Thank you for your suggestion! I will mention in the paper that I reimplemented AudioDec based on the open-source repository to address any concerns from the reviewers.
Excuse me, may I ask how the training set and validation set of VCTK are divided?