Open serv opened 4 years ago
What do you mean by "runtimeDuration".
Do you mean how long the TTS engine will run to generate the given phrase? Do you mean the maximum time for the TTS engine to run to create the given phrase? Do you mean the actual duration of the utterance after it was generated?
I would guess you mean "the actual duration of the utterance after it was generated". If so, you can just get this from the SpeechSynthesisUtterance events.
You are correct. I meant "the actual duration of the utterance after it was generated".
It seems like SpeechSynthesisUtterance events are fired only when the utterance is running "speak".
Would it be possible to get the actual duration of the utterance after it was generated the the time when an utterance is instantiated rather than during it is running "speak"?
The use case I am thinking about it similar to Youtube video showing its runtime duration. A user can see how long a video will last. Similarly, it would be useful to see how long an utterance will run for.
I would guess that it is possible to get as estimate of the duration of an utterance before its audio is generated by, for example, introducing a duration prediction ML model. However, this would impose constraints on the implementation that would make this added feature onerous to implement.
For example, some browsers use the voices native to the OS and these voices, generally, do not have such an associated "duration prediction" ML model. So this would require that browser providers for each OS and for each voice create a "duration prediction" ML model and have these "duration prediction" ML models retrained for each new OS version and each new voice. While possible, I don't think the gained functionality justifies the effort.
I'd be interested in others views on this too.
Agree that this sounds like a "nice to have", but not particularly critical thing to have in the spec.
Hello there :)
Is there a way to compute a value for rate
so that an utterance (text
) is read in n
ms? I am hacking on a subtitles reader script and need to read sentences as fast as they are spoken in the video.
The spec is vague about rate
:
1 is the default rate supported by the speech synthesis engine or specific voice (which should correspond to a normal speaking rate)
How can I quantify "normal speaking rate"?
I'm not sure if this is possible, but I would be useful if I could get the duration of runtime for an instantiated utterance.
According to https://wicg.github.io/speech-api/#utterance-attributes, the runtime attribute is not available.
Given
text
,voice
, andrate
, is it possible for SpeechSynthesisUtterance to also provide theruntimeDuration
?