daisy / pipeline-modules

Modules for the DAISY Pipeline project
3 stars 4 forks source link

Split up long sentences to avoid errors from Google Cloud TTS #103

Open bertfrees opened 1 month ago

bertfrees commented 1 month ago

Occasionally, Google Cloud TTS returns the following error:

Some sentences generate audio that is too long. Consider splitting up long sentences with sentence breaking punctuation (e.g. periods), and/or removing SSML <break> tags.

Since sentence detection is currently done based on ".", it could indeed happen that sentences are very long, e.g. if they contain a lot of commas, colons and/or semicolons.

bertfrees commented 1 month ago

Modifying EuroSentenceDetector in order to make it split on e.g. ";" in addition to "." would be a possible solution for this issue, but it might not be semantically correct anymore to call the result "sentences".

Another approach could be to make it the responsibility of GoogleRestTTSEngine to split the SSML into smaller parts if this error is encountered, similar to how DefaultSSMLMarkSplitter does it.