訓練步驟 - Githubissues

We first train the audio codec using 8 NVIDIA TESLA V100 16GB GPUs with a batch size of 200 audios per GPU for 440K steps. We follow the implementation and experimental setting of SoundStream [19] and adopt Adam optimizer with 2e-4 learning rate. Then we use the trained codec to extract the quantized latent vectors for each audio to train the diffusion model in NaturalSpeech 2.

The diffusion model in NaturalSpeech 2 is trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6K frames of latent vectors per GPU for 300K steps (our model is still underfitting and longer training will result in better performance). We optimize the models with the AdamW optimizer with 5e-4 learning rate, 32k warmup steps following the inverse square root learning schedule.

根據原論文的敘述，似乎他將audio codec 和 diffusion的部分分開來做訓練。想向您請教，不知道有沒有嘗試過將兩個部分分開來做訓練的嘗試，我看到在NS2-ttsv2的訓練上似乎把codec相關的使用全部給mark起來了，是codec的效果不盡人意嗎?

adelacvg / NS2VC

訓練步驟 #30