edrlab / thorium-reader

A cross platform desktop reading app, based on the Readium Desktop toolkit
https://www.edrlab.org/software/thorium-reader/
BSD 3-Clause "New" or "Revised" License
1.64k stars 145 forks source link

TTS sentence splitter, migrate to native web API Intl.Segmenter #2185

Open danielweck opened 1 month ago

danielweck commented 1 month ago

https://www.npmjs.com/package/sentence-splitter

https://github.com/textlint-rule/sentence-splitter/issues/28#issuecomment-2110632032

Edge cases to test: poetry, quotation marks and punctuation that make it hard to determine boundaries. Example: Alice in Wonderland (there are several editions, i think this one is useful for testing https://www.gutenberg.org/ebooks/28885 )

danielweck commented 1 month ago

A good test for large sections of text (which would normally result in far-too-long speech utterances, and therefore benefit from sentence detection) is Georgia: https://idpf.github.io/epub3-samples/30/samples.html#georgia

danielweck commented 1 month ago

Navigator code reference: https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L8

https://github.com/readium/r2-navigator-js/blob/91482324fa2313c4536c48693eb091464a483071/src/electron/renderer/common/dom-text-utils.ts#L935-L978