pietrop commented 5 years ago

I've tried the following experiment

[x] Take Chapter One of Moby Dick
- Audio from LibriVox
- Text from project Gutemberg - Plain text version - only chapter 1
[x] Get the LibriVox audio through STT eg using BBC Kaldi But you could use Gentle or some other STT service tho.
[x] Run through stt-align-node (open sourcing soon 🤞) (aligning algo via @chrisbaume )- This aligns a known accurate transcript to the output of a speech-to-text service. Transposing the times of the words to the accurate text.
[x] Got a re-aligned json file for Moby Dick Chapter 1.
[x] Modified the re-aligned json to be closer to kaldi by replaced word with punct. to be able to re-import into the TranscriptEditor.
[x] Imported into TranscriptEditor, and checked, start, middle, and end of piece, and it seems like the alignment is spot on 🙌

screen shot 2018-12-20 at 16 55 08

Next step

[ ] Try with something more difficult then a librivox recording (which are quiet accurately enunciated)

pietrop commented 5 years ago

So it be good to test this out more extensively but also to try out the "wave form comparison approach" eg WebAligner - similar to Aeneas but perhaps possible client side (?).

pietrop commented 5 years ago

I've tried with the same audio and text from the Ted Talk used in the demo app.

[x] Got the accurate transcription from the ted talk website ted-talk-kate.txt
[x] Kaldi Transcription of Ted Talk - KateDarling_2018S-bbc-kaldi.json
[x] And this is the re-aligned json ted-talk-kate-realigned.json.txt (github won't let me upload a json so you have to download and remove the .txt extension)

TL;DR You can try loading the re-aligned json (ted-talk-kate-realigned.json.) in the demo app and click on words across the text to see the overall quality of the alignment. It's not bad, but it's not 100%, It might be good enough (?).

There might be some optimisation/tweaks possible, such as first aligning at sentence/line level (eg using Levenshtein distance), and then aligning within the sentence level, which might give even more accurate results.

pietrop commented 5 years ago

I tried the other algo for aligning at sentence/line level (using Levenshtein distance), but didn't get to the point of align the words within the sentences, because the sentence level alignment wasn't able to handle the Ted Talk example... needs more investigations. (hopefully open sourcing these algos in the new year)

pietrop commented 5 years ago

Just making a note that another drastically different option for preserving time-codes is to restrict the edit only within word boundaries.

Similar to how @chrisbaume had done in bbc/dialogger.

Also similar to earlier bbc/transcript-editor by @alexnorton - see demo choose decorator option withWords.

This could be done with draft using mutable as mutability of the entities (I think this is already in place) but disallow insertion/editing of text outside of an entity.

Only issue with this approach is what happens if you delete a whole paragraph, and start writing it again from scratch?

pietrop commented 5 years ago

Another option via @Laurian from BBC/Subtitlelizer project

Each word is an entity, so:

if you edit within a word all is fine

if you split a word, you have a space inside an entity, so you can split the entity data into 2 words

if you join a word, you have entities with no space in between, you can join into a single one

if you have text without an entity range around, that’s new typed stuff, you can recompute/average what that data might be

in subtitalizer since only the start/end of a paragraph (caption item) matter, I always do 4 in this way:

split into words

recompute/estimate based on word length vs paragraph duration

recreate new entities for the words in that para

how do you handle edge cases, when someone deletes a chunk of text, like a whole paragraph or parts of it?

if a block of text is deleted, the timing per para is always computed from the first and last entity that deals well with joining and splitting paragraphs too

https://github.com/bbc/subtitalizer/blob/master/src/components/TranscriptEditor.js#L597

in subtitalizer I trigger that on retiming the timecodes by hand, and on splitting/joining paragraphs only. split/join is easy to detect onChange, even by looking at number of blocks. now if you want to do this onchange on every keystroke, better debounce it as it will slow you down

retime() in subtitalizer averages timing data over the existing and new words in a paragraph, it uses existing timing for start/end para or that can be supplied in subtitalizer by hand is when you change the timecode in the timecode widgets per para

now you might just need this averaging to apply only if the averaging value is massively different for a word, so in a way preserve exisiting timings.

TL;DR:

Words

word is modified in place - not time-codes changes
word is split into two - can use start and end of previous word and duration of new to infer/calculate new timings.
word inserted - can use end and start time of word before and word after to calculate time-codes for that word.

Words can have a base duration for when is very short, but can also be computed if longer (etc.. there’s a bunch of different ways to do this).

Paragraph

if a paragraph/a block of text is deleted is deleted, the timing per para is always computed from the first and last entity around it. (need to look at this a bit more closely).
get end time of last entity as start time of word, and then computer duration to do end time of word. keep an eye on next paragraph/entity start time to avoid overlap.
if doing this onChange on every keystroke, needs debouncing to keep up performance

pietrop commented 5 years ago

To recap the options:

Aeneas server side (altho @Laurian said there might be ways to get Aeneas to fit in a AWS Lambda(?))
Web aligner (aligner client side, equivalent of having Aeneas running in the browser - done by @chrisbaume , not fully working yet but it could with some tweaking) - more info here
Using STT json and diff algo to transpose timecodes to accurate text - more info here
computing entities time-codes on change - as described above by @Laurian - more info here

chrisbaume commented 5 years ago

I think the web aligner will take more work than you make out. The current web aligner proof-of-concept just looks for gaps in the audio amplitude. A better approach would be to follow the same algorithm as aeneas: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITWORKS.md

Convert text to speech using https://w3c.github.io/speech-api/speechapi.html#tts-section
Extract MFCCs from original speech and generated speech using https://github.com/meyda/meyda
Perform dynamic time warping using https://github.com/langholz/dtw or https://github.com/GordonLesti/dynamic-time-warping

I couldn't find any JS libraries that do Sakoe-Chiba Band DTW, so the above algorithms may be too slow to be practical. As such, they might have to be modified to use the Sakoe-Chiba Band approach.

If it were up to me, I'd go for option 1. However, I would set it up so that it only aligns the corrected bits rather than the whole thing. I don't know much about Lambda, but spinning up an instance might take a few seconds, which would be too slow IMO.

pietrop commented 5 years ago

addressed in https://github.com/bbc/react-transcript-editor/pull/144 by @murezzda

pietrop commented 5 years ago

closing for now, as it's been added to https://github.com/bbc/react-transcript-editor/pull/175 and soon to be merged into master, pending @jamesdools review of https://github.com/bbc/react-transcript-editor/pull/144

bbc / react-transcript-editor

A way of preserving or restoring time-codes while or after editing the text #30

TL;DR:

Words

Paragraph