Closed manhcuong02 closed 1 week ago
Thanks for your consideration of my work. These are some parameters that I used in my paper. 'train': {'log_interval': 100, 'eval_interval': 400, 'seed': 1234, 'epochs': 1600, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}. When I trained the model, the loss reduced from about 35.xx --> 14 and converged about epoch 1500s. You can try with the above params. Good luck!
@thelinhbkhn2014 Hello, I've trained the XphoneBert model on the infore1 dataset and my own data. However, I'm facing a problem where the results are very poor for inputs longer than 2 sentences. As the generated speech progresses, it increasingly mispronounces words from the previous sentences. Have you encountered this situation before? And could you please explain why I'm facing this issue? I tried the original VITS model on an English dataset with the author's pretrained checkpoints and didn't encounter this issue. Thank you.
Edit: I'm encountering a limitation with the model where the input is restricted to 514 characters. I've checked the source code but couldn't locate any specific reference to this constraint. Would you be able to shed some light on this for me?
Here is a sample example that I have tested. The results in the second half of the paragraph are very poor:
hàng cây me_tây đầu phố đã bắt_đầu rụng lá . những chiếc lá vàng nhỏ_xíu , khô_cong như những bàn_tay bé xíu khum lại , xoay_tròn trong gió rồi nhẹ_nhàng đáp xuống mặt đường đầy bụi . chiều nay , trời se_lạnh . hương hoa sữa thoang_thoảng đâu_đó , hòa lẫn với mùi khói xe nồng_nặc , tạo nên một thứ mùi đặc_trưng của mùa thu hà_nội . tôi cuộn tròn trong chiếc áo len_dày , lặng_lẽ ngắm nhìn dòng người qua_lại .
My generated audio: result.wav
When you were training, to what loss value did it decrease to produce the results you reported in your paper? When I trained on a small dataset of 2 hours of Vietnamese audio for up to 1000 epochs, my
loss/g/mel
fluctuated around 20 and wouldn’t go down further. Theloss/g/kl
showed the same pattern. My training results were quite poor, and many times, it only produced 1-second audio with nothing but noise. Could you please give me some advice? Thank you