X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"
https://cantabile-kwok.github.io/VoiceFlow/
303 stars 21 forks source link

about training #16

Closed chasing-ant closed 3 weeks ago

chasing-ant commented 1 month ago

Hi,thanks for your great work. I'm having trouble running the following command in terminal: python train.py -c configs/lj_16k_gt_dur.yaml -m lj_16k_gt_dur But the following error occurs:

File "*/VoiceFlow-TTS-main/data_loader.py", line 22, in check_frame_length 
assert sum(dur) == mel.shape[1], f"Frame length mismatch: utt {utt}, dur: {sum(dur)}, mel: {mel.shape[1]}"                                          
AssertionError: Frame length mismatch: utt LJ043-0008, dur: 554, mel: 553

I changed this line of code to abs(sum(dur) - mel.shape[1]) <= 1 and it works, but I don't know if it has any effect on the result. Appears during operation

numpy/core/fromnumeric.py:3440: 
     RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype,

and

numpy/core/_methods.py:189: 
      RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
cantabile-kwok commented 1 month ago

A similar issue is: https://github.com/X-LANCE/VoiceFlow-TTS/issues/11#issuecomment-2084334819

@chasing-ant This length mismatch is a common phenomenon, and you can overcome this by truncating or padding the features to the same length. In your case, as mel is one frame shorter than durations, the recommended solution is to zero-pad the mel sequence by 1 frame. I am not sure whether the numpy RuntimeWarning will affect the result (intuitively it won't), but at least padding or truncating before training can avoid such warnings.

chasing-ant commented 1 month ago

A similar issue is: #11 (comment)

@chasing-ant This length mismatch is a common phenomenon, and you can overcome this by truncating or padding the features to the same length. In your case, as mel is one frame shorter than durations, the recommended solution is to zero-pad the mel sequence by 1 frame. I am not sure whether the numpy RuntimeWarning will affect the result (intuitively it won't), but at least padding or truncating before training can avoid such warnings.

I'll give it a try, thank you for your detailed explanation.