Implementation of Disfluencies in TextToSpeech

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

12.06k stars 2.51k forks source link

Implementation of Disfluencies in TextToSpeech #8109

Closed rodrigoGA closed 8 months ago

rodrigoGA commented 10 months ago

I am working on a project that involves creating dialogues that sound spontaneous and natural. A key feature I would like to implement is the use of disfluencies (such as “Mm-hmm”) during moments when calculations are being made or there are pauses in the dialogue.

So far, I have experimented with using phonemes to create these disfluencies, but the results have not been satisfactory. It's possible that I haven't found the correct IPA pronunciation for these expressions. (If you have any IPA pronunciation suggestions to try, they would be greatly appreciated).

My questions are as follows:

Is the integration of disfluencies in dialogue currently supported?
If possible, could you recommend a strategy or approach to implement these disfluencies more naturally?

BilalNaazir commented 9 months ago

Any update on this?

XuesongYang commented 9 months ago

Hi @rodrigoGA, thanks for your interest in NeMo TTS toolkit. If sticking to our current phoneme-based TTS models, such as FastPitch, you have to add new IPA dictionary entries for those filler words, and better to have paired filler speech as well.

So far as I know, we haven't added support for disfluencies synthesis yet, unless the training corpus has filler speech/text pairs and corresponding phonemes.

If you have filler speech/text pairs, and can't figure out canonical phonemes for filler words, you may try to use grapheme-based tokenizer for FastPitch.

rodrigoGA commented 9 months ago

Hi, thanks for the detailed response. It's really interesting and useful to consider support for disfluency synthesis. I hope this feature will be considered for future training versions.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.