Support runtimeDuration attributes in SpeechSynthesisUtterance

WICG / speech-api

Web Speech API

https://wicg.github.io/speech-api/

145 stars 30 forks source link

Support runtimeDuration attributes in SpeechSynthesisUtterance #71

Open serv opened 4 years ago

serv commented 4 years ago

I'm not sure if this is possible, but I would be useful if I could get the duration of runtime for an instantiated utterance.

According to https://wicg.github.io/speech-api/#utterance-attributes, the runtime attribute is not available.

Given text, voice, and rate, is it possible for SpeechSynthesisUtterance to also provide the runtimeDuration?

kdavis-mozilla commented 4 years ago

What do you mean by "runtimeDuration".

Do you mean how long the TTS engine will run to generate the given phrase? Do you mean the maximum time for the TTS engine to run to create the given phrase? Do you mean the actual duration of the utterance after it was generated?

I would guess you mean "the actual duration of the utterance after it was generated". If so, you can just get this from the SpeechSynthesisUtterance events.

serv commented 4 years ago

You are correct. I meant "the actual duration of the utterance after it was generated".

It seems like SpeechSynthesisUtterance events are fired only when the utterance is running "speak".

Would it be possible to get the actual duration of the utterance after it was generated the the time when an utterance is instantiated rather than during it is running "speak"?

The use case I am thinking about it similar to Youtube video showing its runtime duration. A user can see how long a video will last. Similarly, it would be useful to see how long an utterance will run for.

kdavis-mozilla commented 4 years ago

I would guess that it is possible to get as estimate of the duration of an utterance before its audio is generated by, for example, introducing a duration prediction ML model. However, this would impose constraints on the implementation that would make this added feature onerous to implement.

For example, some browsers use the voices native to the OS and these voices, generally, do not have such an associated "duration prediction" ML model. So this would require that browser providers for each OS and for each voice create a "duration prediction" ML model and have these "duration prediction" ML models retrained for each new OS version and each new voice. While possible, I don't think the gained functionality justifies the effort.

I'd be interested in others views on this too.

marcoscaceres commented 4 years ago

Agree that this sounds like a "nice to have", but not particularly critical thing to have in the spec.

giuseppeg commented 3 years ago

Hello there :)

Is there a way to compute a value for rate so that an utterance (text) is read in n ms? I am hacking on a subtitles reader script and need to read sentences as fast as they are spoken in the video.

The spec is vague about rate:

1 is the default rate supported by the speech synthesis engine or specific voice (which should correspond to a normal speaking rate)

How can I quantify "normal speaking rate"?