TTS Audio: Long post with 10+ mins of TTS audio does not get processed and handled.

joshuaabenazer commented 1 year ago

Describe the bug

The TTS Speech service seems to limit the audio files to a maximum length of 10 mins. This is regardless of a free or paid account - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits#real-time-text-to-speech
Another reason could be that the access token itself is only valid for 10 mins - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-to-text-short#how-to-use-an-access-token
Anything above the 10 min mark does not make it through and is aborted the moment the TTS API detects that.
As per some tests I did - close to 1300 words accounts for upto 10 minutes of audio - this is a ballpark word count to check what makes the 10 minute audio cut and could be dependent on punctuation and length of the words used.
Via the interface I was able to confirm that the editorial experience works for any post that comes under the 10 minute mark.

Possible solutions:

Make use of read time to determine if a post will be able to generate audio and then provide certain hints and disable audio creation in such cases.
Review how we can leverage Batch synthesis to achieve this - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-synthesis ( although this could be paid maybe )
There also exists the Long Audio API that could be reviewed for this but it will get deprecated once batch synthesis is released.

Steps to Reproduce

Test out audio generation on a post containing lengthy content somewhere close to 1300-1500 words atleast, and notice that the audio generation process starts but then ends without any notification / errors and also does not generate the audio file.

Screenshots, screen recording, code snippet

No response

Environment information

No response

WordPress information

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sidsector9 commented 1 year ago

I did some digging into Batch synthesis and I have 1 concern with how it works.

So, creating a batch synthesis is straight forward, call the /batchsynthesis endpoint with the input data, and that starts it. But, this endpoint does not return the audio data. It returns a JSON object with some information about the batch, one of the property is the batch ID.

To get the audio data, we are required to call /batchsynthesis/batchId periodically to check if the audio has been created for that batch. This endpoint returns the URL for the ZIP that will contain our audio file.

Few approaches that we can take:

The user keeps the editor open for the periodic checks to download the audio file and attach it to the post.
Or, we can allow the user to close the editor by adding a "status" post meta that indicates the last status of the batch. Once the editor is reopened for that post, the periodic check will resume and if once completed, we can download the audio file. (The delay here will be downloading and attaching the audio file to the post)
Or, we can implement a server-side batch processing similar to Action Scheduler which can take care handle the audio download + attaching to post.

From @joshuaabenazer's points, we do (1), and depending on the estimated time of the audio file, we can conditionally use the existing TTS REST API or the Batch Synthesis API.

@10up/open-source-practice any thoughts on this?

dkotter commented 1 year ago

I don't think we should build something that requires a user to stay on a particular page until the process completes. I think option 3 sounds the best to me, where the audio process is kicked off once the content is published and the processing of that happens behind the scenes, so doesn't matter if someone leaves the post or not. Would be great if the UI still updates automatically if someone happens to stay on the post screen the entire time.

I believe we do something similar right now with our PDF read functionality, so may be worth looking at that. I also know I've seen other APIs that require polling an endpoint to see if the process is complete, so having a solution that can be reused would be great.

faisal-alvi commented 1 year ago

I agree with the 3rd point as users do not have to wait in that approach (as @dkotter mentioned). Yet, I would like to suggest an additional aspect to consider, we should also notify users via email once their file is ready and attached to the post. This approach eliminates the need for users to frequently return to the post to check if the file is ready or not.

10up / classifai