Closed pietrop closed 5 years ago
So it be good to test this out more extensively but also to try out the "wave form comparison approach" eg WebAligner - similar to Aeneas but perhaps possible client side (?).
I've tried with the same audio and text from the Ted Talk used in the demo app.
[x] Got the accurate transcription from the ted talk website ted-talk-kate.txt
[x] Kaldi Transcription of Ted Talk - KateDarling_2018S-bbc-kaldi.json
[x] And this is the re-aligned json ted-talk-kate-realigned.json.txt (github won't let me upload a json so you have to download and remove the .txt
extension)
TL;DR
You can try loading the re-aligned json (ted-talk-kate-realigned.json.
) in the demo app and click on words across the text to see the overall quality of the alignment. It's not bad, but it's not 100%, It might be good enough (?).
There might be some optimisation/tweaks possible, such as first aligning at sentence/line level (eg using Levenshtein distance), and then aligning within the sentence level, which might give even more accurate results.
I tried the other algo for aligning at sentence/line level (using Levenshtein distance), but didn't get to the point of align the words within the sentences, because the sentence level alignment wasn't able to handle the Ted Talk example... needs more investigations. (hopefully open sourcing these algos in the new year)
Just making a note that another drastically different option for preserving time-codes is to restrict the edit only within word boundaries.
Similar to how @chrisbaume had done in bbc/dialogger.
Also similar to earlier bbc/transcript-editor by @alexnorton - see demo choose decorator
option withWords
.
This could be done with draft using mutable as mutability of the entities (I think this is already in place) but disallow insertion/editing of text outside of an entity.
Only issue with this approach is what happens if you delete a whole paragraph, and start writing it again from scratch?
Another option via @Laurian from BBC/Subtitlelizer project
Each word is an entity, so:
- if you edit within a word all is fine
- if you split a word, you have a space inside an entity, so you can split the entity data into 2 words
- if you join a word, you have entities with no space in between, you can join into a single one
- if you have text without an entity range around, that’s new typed stuff, you can recompute/average what that data might be
in subtitalizer since only the start/end of a paragraph (caption item) matter, I always do 4 in this way:
- split into words
- recompute/estimate based on word length vs paragraph duration
- recreate new entities for the words in that para
how do you handle edge cases, when someone deletes a chunk of text, like a whole paragraph or parts of it?
if a block of text is deleted, the timing per para is always computed from the first and last entity that deals well with joining and splitting paragraphs too
https://github.com/bbc/subtitalizer/blob/master/src/components/TranscriptEditor.js#L597
in subtitalizer I trigger that on retiming the timecodes by hand, and on splitting/joining paragraphs only. split/join is easy to detect onChange, even by looking at number of blocks. now if you want to do this onchange on every keystroke, better debounce it as it will slow you down
retime()
in subtitalizer averages timing data over the existing and new words in a paragraph, it uses existing timing for start/end para or that can be supplied in subtitalizer by hand is when you change the timecode in the timecode widgets per paranow you might just need this averaging to apply only if the averaging value is massively different for a word, so in a way preserve exisiting timings.
Words can have a base duration for when is very short, but can also be computed if longer (etc.. there’s a bunch of different ways to do this).
onChange
on every keystroke, needs debouncing to keep up performanceTo recap the options:
I think the web aligner will take more work than you make out. The current web aligner proof-of-concept just looks for gaps in the audio amplitude. A better approach would be to follow the same algorithm as aeneas: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITWORKS.md
I couldn't find any JS libraries that do Sakoe-Chiba Band DTW, so the above algorithms may be too slow to be practical. As such, they might have to be modified to use the Sakoe-Chiba Band approach.
If it were up to me, I'd go for option 1. However, I would set it up so that it only aligns the corrected bits rather than the whole thing. I don't know much about Lambda, but spinning up an instance might take a few seconds, which would be too slow IMO.
addressed in https://github.com/bbc/react-transcript-editor/pull/144 by @murezzda
closing for now, as it's been added to https://github.com/bbc/react-transcript-editor/pull/175 and soon to be merged into master, pending @jamesdools review of https://github.com/bbc/react-transcript-editor/pull/144
I've tried the following experiment
[x] Take Chapter One of Moby Dick
[x] Get the LibriVox audio through STT eg using BBC Kaldi But you could use Gentle or some other STT service tho.
[x] Run through stt-align-node (open sourcing soon 🤞) (aligning algo via @chrisbaume )- This aligns a known accurate transcript to the output of a speech-to-text service. Transposing the times of the words to the accurate text.
[x] Got a re-aligned json file for Moby Dick Chapter 1.
[x] Modified the re-aligned json to be closer to kaldi by replaced
word
withpunct
. to be able to re-import into the TranscriptEditor.[x] Imported into TranscriptEditor, and checked, start, middle, and end of piece, and it seems like the alignment is spot on 🙌
Next step