Open eschmidbauer opened 8 hours ago
We currently estimate the duration linearly (if the prompt is not too short or of very high speed, it will work), or we just set a fixed duration. A separately trained duration predictor model could be leveraged to provide the duration info, though I thought to control precisely the duration of each utterance is more of the way Voicebox does. But for sure, we could have a utterance-level duration predictor and just pass in the add-up duration to TTS model.
Thank you for sharing this project and models. Testing inference and it is very impressive. One thing i have noticed is how sensitive the
fix_duration
value is-- if it is off by 1s , the utterance may miss a word. Is there a way to predict duration so it is accurate on each utterance? Thanks again for sharing this work!