festvox / flite

A small fast portable speech synthesis system
Other
869 stars 188 forks source link

Is there a way to increase/add the pause between words? #77

Open Atreyagaurav opened 2 years ago

Atreyagaurav commented 2 years ago

Although flite is very fast, it sounds like the words are attached together while speaking, the whitespace that should separate the words is hard to determine in the file that is generated.

For example this line, when spoken the words are attached (I have found that for almost all words in a sentence), I can recognize the words when I'm looking at the text, but it's hard when the text is not there.

flite -ps -t "hello my name is John Doe!"

Outputs:

pau hh ax l ow m ay n ey m ih z jh aa n d ow pau

And when spoken, (without ps flag), the sound is exactly like that. The pauses are only between the sentences and not between the words.

I tried to look through the documentations and not finding anything, I tried to look through the code to see if I can increase the pause duration, but i couldn't find anything at all.

I found it hard to imagine I'm the only one who noticed this but I couldn't find anything on it so I'm making this issue.

awbcmu commented 2 years ago

Flite does care about word boundaries in its predictions (even though it doesn't show that in the list of segments).

Flite will predict phrase boundaries based on punctuation (sometimes) and length of phrases too.

You can explicitly put in breaks by using ssml, speech synthesis markup language. See flite/tools/example.ssml, you need to add -ssml to go into ssml mode.

Note if you give the text on the command line it will always be treated as a single sentence. If you put it in a file, it will do sentence segmentation too, (fullstops/periods and blank lines will cause sentence boundaries).

Also you can control the overall speed by setting the global duration stretch (less than 1.0 will make it faster, bigger that 1.0 will make it slower) e.g.

./bin/flite --setf duration_stretch=0.9 doc/alice

You can also control this factor in ssml

Alan

On Wed, May 18, 2022 at 10:53 AM Gaurav Atreya @.***> wrote:

Although flite is very fast, it sounds like the words are attached together while speaking, the whitespace that should separate the words is hard to determine in the file that is generated.

For example this line, when spoken the words are attached (I have found that for almost all words in a sentence), I can recognize the words when I'm looking at the text, but it's hard when the text is not there.

flite -ps -t "hello my name is John Doe!"

Outputs:

pau hh ax l ow m ay n ey m ih z jh aa n d ow pau

And when spoken, (without ps flag), the sound is exactly like that. The pauses are only between the sentences and not between the words.

I tried to look through the documentations and not finding anything, I tried to look through the code to see if I can increase the pause duration, but i couldn't find anything at all.

I found it hard to imagine I'm the only one who noticed this but I couldn't find anything on it so I'm making this issue.

— Reply to this email directly, view it on GitHub https://github.com/festvox/flite/issues/77, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOEXNCFKNHTJHSLKLDNRVDVKT75LANCNFSM5WIYH3DQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Atreyagaurav commented 2 years ago

Thank you for your response.

Flite does care about word boundaries in its predictions (even though it doesn't show that in the list of segments).

Does it have any variables that are used to generate that break? If possible a scale parameter just like for the overall voice, but only to be applied for the breaks. Same scale parameter could work for word boundaries, phrases boundaries and sentence boundaries and I'll be fine with it. And if there isn't one, maybe we can add it if there is the code to generate that silence.

--setf duration_stretch doesn't work much for me because it also slows down the words which sounds weird if I want larger break between words.

I did found a thing called utterance break in the source code so I thought the sentences were changed to utterances not words. But even then I couldn't find how to modify the duration of that thing. I also tried to look at how utterances are used but to no avail. Somehow it looks like the wave struct seems to have samples array which may have those voices between the breaks.

I also looked at the file testsuite/by_word_main.c to see if it has the thing I want, it looks promising but it seems to be mostly focused on printing. But this seems to be by far the best option, though I wasn't able to modify it to increase the break.

You can explicitly put in breaks by using ssml, speech synthesis markup language. See flite/tools/example.ssml, you need to add -ssml to go into ssml mode.

Thank you for this, I did give it a look and it does help in adding breaks. However it doesn't seem to be perfect, for example:

This <break /> is <break /> a  <break /> pen.

Doesn't sound natural, the pronunciation of a is truncated. While the line below sounds fine.

This <break /> is <break /> a  pen.

So unless I have to parse a sentence to find the natural break points myself it'd be hard to use.

Which gets me back to the reason I was hoping to just drag a little of that silence between the words/phrases whatever that festival uses so I don't have to work hard.

Sorry for the trouble. Hopefully it isn't much.