Would it be possible to add some option to delimit the transcribed output as timestamp-prefixed lines, or some other mark/metadata when each word occurs in the source media?
This is the way I was thinking I could hack it, if there isn't any way to surface this from the lower-level implementation:
Split the audio into chunks of lineDuration seconds (where lineDuration is the number of seconds to elapse between each line, like 5 or 10).
Get the transcript for each of those spans of text.
To ensure no words are getting cut on the clip boundary, produce a transcript for the gapSpan long seconds of audio on either side of the cut boundary (where gapSpan is some amount of time we expect the transcription to become stable within: I would guess something like four seconds would probably be fine).
If the transcript of the seam section conflicts in its middle with the transcript of the two sections concatenated, replace the words (in roughly balanced proportion) at the ends of the lines with the transcribed words from the seam.
Would it be possible to add some option to delimit the transcribed output as timestamp-prefixed lines, or some other mark/metadata when each word occurs in the source media?
This is the way I was thinking I could hack it, if there isn't any way to surface this from the lower-level implementation:
lineDuration
seconds (wherelineDuration
is the number of seconds to elapse between each line, like 5 or 10).gapSpan
long seconds of audio on either side of the cut boundary (wheregapSpan
is some amount of time we expect the transcription to become stable within: I would guess something like four seconds would probably be fine).