audiobook: support reading from audio files in TTS

NOTE: this is a working audiobook impl, but it is FAR from polished, and it is NOT plug-and-play. i am already using it all the time, though, so i figured i'd stick it here in case someone else is interested.

FEATURES

supports reading from mp3/ogg/flac/wav files, instead of android TTS engines, inside the TTS ReadAloud Module
supports sentence navigation, pause/play/next/prev, etc
moves the selected sentence as playback progresses, and plays from the selected sentence
does not assume audiobook and ebook are identical, makes best guesses and moves along
- even large missing or additional sections generally work fine
- if the audiobook plays a section missing from the ebook, the visual sentence selector waits for the audio to catch up
- if the ebook has a section missing from the audiobook, the visual sentence selector skips a sentence every half second until it catches up.
does not skip over audiobook intros/music etc. the full audiobook is played, and the place in the book moves along while the content is read

USAGE

requires creating a .wordtiming text file for each e-book OUTSIDE OF COOLREADER
- this file can be generated from any speech-to-text system like vosk that supports per-word timings
- ebook-audiobook-wordtiming is a script for generating this using vosk, pandoc, gnu-diff, python, + perl
- see:
- https://github.com/teleshoes/ebook-audiobook-wordtiming (for vosk-words-json and ebook-audiobook-wordtiming)
- https://github.com/alphacep/vosk-api
- https://pypi.org/project/vosk/
once created, place the wordtiming file, ebook, and audiobook mp3/flac/etc files in the same directory
open the ebook as normal, and start TTS. if there is a wordtiming file, the audiobook is used

EXAMPLE

attached is A Christmas Carol, by Charles Dickens, from Project Gutenberg and Librivox (all works in the public domain)
- https://drive.google.com/drive/folders/1abepyfOW9on94tiZpoEdN4QBYuud_jk4?usp=sharing
- includes: e-book (.txt), audiobook, (.flac), and wordtiming file, (*.wordtiming)
download all files, and extract the zip file to any directory on the device

wordtiming file was generated with this script:

ebook-audiobook-wordtiming \
a_christmas_carol_charles_dickens_project_gutenberg_19337.txt \
A_Christmas_Carol*.flac \
-o a_christmas_carol_charles_dickens_project_gutenberg_19337.wordtiming

perl

P1) vosk-timing-data - run vosk-words-json on WAV files, get statistics on each word
- output is a big JSON file
- this is the only CPU-intensive/long-running step
- the result is LZMA compressed and cached
- the cache can be generated with or without an ebook, for processing later
- (i plan on running this on all audiobooks as i acquire them)
P2) audio-word-timing - process vosk-timing-data into a CSV with three columns: AUDIO_WORD,START_TIME_SECONDS,AUDIO_FILE
- this is the start time of each word in the AUDIOBOOK
P3) audio-word-list - make a copy of audio-word-timing and remove the START_TIME_SECONDS column
P4) ebook-word-list - process the EPUB/FB2/TXT file into a list of words (one word per line) with pandoc
- this step tries to apply the same rules the coolreader will use later
P4) ebook-audio-diff - align audio-word-list and ebook-word-list
- this is the ONLY STEP that compares the ebook to the audiobook
- this step handles bad spelling, bad pronunciation, proper nouns, missing passages, extra words, skipped sentences, footnotes read in line instead of at the end of the chapter, EVERYTHING that is different between the audiobook and the ebook
- it uses the myers difference algorithm, finding the Longest-Common-Subsequence
- i.e.: its literally just diff -y
P5) ebook-word-timing - combine ebook-word-list, ebook-audio-diff, and audio-word-timing to get ebook timing
- take ebook-word-list and add two column, START_TIME_SECONDS and AUDIO_FILE
- using ebook-audio-diff, find the highest index of the word in audio-word-list that is not part of the longest-common-subsequence of a later word
- take that index and get the timing from audio-word-timing, and fill in START_TIME_SECONDS/AUDIO_FILE columns with it
- after this point, the words in the audiobook are not used and never appear again
- you will never see mispronounced words, spoken errors, etc. every word from here on out appears, in order, in the FB2/EPUB/TXT
- this is the start time of each word in the EBOOK
- this is the contents of *.wordtiming, and is the final output of the perl script

coolreader

CR1) sentence-info - when audiobook-tts starts, navigate to each sentence and get info
- jump to page 0
- select the first sentence on the page
- select the next sentence, repeatedly, until there are no sentences left in the book
- after selecting each sentence, record the sentence-text and the dom-start-pos
- this is done in CPP, invoked from a JNI file, and sent to java as a List<SentenceInfo`
- this is turned into an (improper) CSV with two columns, START_POS and TEXT
- start pos is a DOM id, from ldomXPointerEx->toString() (it never has any commas)
- TEXT is allowed to contain commas, because its the last column (hence, this is an improper CSV)
- e.g.: /text/p[45].135, having little or no money in my pocket
- java caches this sentence info in a file, *.sentenceinfo, if coolreader has write perms where the ebook is
- this step is the only slow part in coolreader
- it is purely the coolreader sentence structure parsing
- it has NOTHING to do with audiobooks, or wordtimings, or anything
CR2) sentence-words - parse each sentence into a list of words, as close to step P4) as possible
- its never exactly the same, because pandoc is not coolreader, but the difference is SMALL
CR3) sentence-start-times - compare sentence-words to ebook-word-timing file
- load *.wordtiming, parse into a list of word/start-time pairs
- for each sentence, take each word in that sentence and try to apply to the next word in wordtiming
- a sentence must match EVERY SINGLE WORD in sentence-words to a word in wordtiming
- however, allow skipping up to 20 WORDTIMING words (this is arbitrary, but if its very long, passages could be skipped)
- if a sentence does not match every word (happens all the time), consume the words that did match, use the lowest start time matched, and continue to the next sentence
- the output here is a start TIME and start PHYSICAL-POSITION for every sentence (this is the final pre-processing goal)
CR4) start-playback - play audiobook instead of TTS
- when tts-start is invoked, get the DOM location of the start of the initially selected sentence
- select the closest sentence in sentence-start-times
- open the audio file, and seek to the start time
- NOTE: if this is the first sentence in the audio file, ALWAYS seek to 0
- this way, you hear the full audiobook
- start music, copyright, "Start of CD Number Three", etc
CR5) continue-playback - select the next sentence as audiobook position continues
- do not stop playback, ever, without user interaction (this way, you get to hear the end of each audiobook file)
- when media playback time is after the next sentence in sentence-start-times, select the next sentence as if user clicked Next >>
- when media playback STOPS, and the next sentence is the FIRST sentence of a new audio file, start the next audio file

buggins / coolreader

audiobook: support reading from audio files in TTS #353

perl

coolreader