[TTS] Using a "reference" audio clip to control duration when synthesizing audio?

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.77k stars 2.45k forks source link

[TTS] Using a "reference" audio clip to control duration when synthesizing audio? #4932

Closed throwaway30 closed 1 year ago

throwaway30 commented 2 years ago

If I have an audio clip, for example the "Tears in rain" monologue from Blade Runner, is there any way to replicate the pacing/duration of that audio when generating a spectrogram for TTS?

The "Inference_DurationPitchControl" tutorial covers changing the pace of the entire clip by some fixed percentage, but what I actually want is to change the duration of the individual words/phonemes (and the gaps between them) to match their durations in the reference audio.

If this isn't currently available, is it feasible as a future feature?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

XuesongYang commented 1 year ago

yes, we support the feature with SSML. https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-python-advanced-customization-with-ssml.html

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.