Export recording of prompted performnce as subtitles or closed captions

HansVanNeck commented 2 years ago

Is your feature request related to a problem or a limitation? Please describe...

Export to subtitle file. This is a file that can be uploaded to youtube or branded in a video. It shows a text on the right moment. In Europe, especially in small countries, nearly all professional videos are subtitled, It is easier for older people and people with hearing problems. And has the added advantage that you can look at videos in the office without annoying your colleges. It is for business, government and professionals, and rather impolite, to publish a video that is not subtitled.

Describe the solution you'd prefer First I want to describe the workflow. Our workflow, rather standard in industry for the production of professional (sales, demonstration, instruction, education) videos is as follows.

Create the teleprompter file. This is done by an editor. It has a very tightly speed parameters QPrompt is ideal because we cannot only include text but also in another font or color directives.
An actor (the speaker) creates a video
A video editor combines this video with additional nice video elements.
The teleprompter file is converted to subtitle file.
The subtitle file is added by the video editor to the final video.

The problem is in step 4, because we do not have timing data. But Qprompt has that somewhere.

The SRT File is an standard extentions: he open source standard is SubTitle Edit.

Describe alternatives you've considered We now export to txt, use SubTitle Edit to convert it to SRT, upload it together with the teleprompter video to Google, that sets the time stamps. This is not going well always. And after that, we download the subtitle file and continue with step 5. T ...

Provide use examples

Cuperino commented 2 years ago

I really like this suggestion! If anyone else wants to export subtitles, please leave your comments and suggestions here.

Cuperino commented 2 years ago

Implementation wise, the greatest challenge is to keep timestamps in sync. It is not possible to precise estimate timestamps when the user uses a small font with a relatively wide prompter that allows for many words per line. This doesn't seem to be standard practice in studios anywhere, but people who record from webcams find it very convenient.

I find it that a Speech To Text conversion would be necessary to achieve precisely estimate timestamps. The only advantage to using a teleprompter program to create subtitles would be that you'd be able to use the original text for the subtitles instead of the text produced by the STT conversion.

Wouldn't it be easier if you could give the edited video and its teleprompter file/s to an intermediary or video editor program that uses STT to match the voice with the text and generate the preliminary SRT file? The SRT could then be refined using an editor like Subtitle Edit or Aegisub before upload.

HansVanNeck commented 2 years ago

I do not expect precise synchronisation.
I expect precise text, with (important) that the sentences are not cut of in the middle of a sentence. Seel below.

Speech to text conversion is nice for informal language. On Dutch television, if it is used, (For instance at a life press conference) we first get a long warning that the subtitles are not correct. But for professional videos you cannot use it. And speech to text conversion in the Dutch language is even worse. (Laughable wrong, sometimes do not accept the Dutch word for "not", so you get the opposite meaning.).

At the moment, we use Google. It gives a100% precise synchronisation, and is far better than the synchronisation we expect from the subtitle export. However, it often breaks a sentence in the middle, does not allow for multiple line subtitles. But 100% synchronization is not what we want. People are not robots.

Google will never be able to get it right, because you want in the subttitle to present a full sentence. So you show the full sentence in the subtitle 0,2 second before the sentence is started, including the words that are not just spoken. And you keep it 0,5 second on the screen after the sentence is ended. (There are two kinds of people: that read faster, and people that read slower, you must accommodate them both).

So we must edit it anyhow, and check very precisely... And that is a lot of work because we need to recombine the subtitles. With an exported file, we only would have to move it a little. And with typically 100 subtitles per file, that is not much work.

What I did forget in my requirement is to be able to detect the start of a new line (Hard new line) and a new line because the length of the subtitle is exceeded (Soft new line). In the subtitle, this has the effect of a new subtitle, or a new line with a subtitle.

Cuperino commented 2 years ago

The problem with Google's model is it can't accurately determine the start or end of a sentence, so instead they compromise punctuation marks and subtitles end up becoming a stream of words; at least that's my understanding of what is done for videos on YouTube.

Having the original text as reference is a great advantage here. Nevertheless QPrompt's ability to time the start and end of sentences is be bounded to the imprecision of how many words fit in a line, and how fast is that line scrolled past. This is why something like STT is needed for better synchronization. Having useful subtitles exported solely from settings would be a frustrating experience because of the amount of trial and error users would have to do to match the recording.

Precise text could be achieved by copying the teleprompter text to the subtitle file, after fuzzily matching it against a STT conversion. The fuzzy match would be done using characters or phonemes instead of words for better accuracy (using a dictionary against segmented words, this could later be made work with logographic writing systems as well). The program would then use punctuation marks from the original text to determine the start and end positions of individual subtitles, and match those positions with the more precise time codes from the STT conversion. Add or subtract the 0,2 and 0,5 paddings as applicable and you'd get very readable subtitles. Text would match what was used for the teleprompter and synchronization would be usable and much closer to Google's.

The big question is whether this should be a separate program or a feature in QPrompt. I see this more as a separate program that QPrompt can communicate with. This program would be used as a standalone program to match text with edited recordings; or it could be connected to QPrompt to be used live. QPrompt would then indicate which lines to use as reference and update any changes to the script using a protocol like MOS. The program would then perform a live conversion, word by word, to generate closed captions from the teleprompter text.

You'd get full sentence subtitles for edited recordings and precise closed captions for live streams.

HansVanNeck commented 2 years ago

Thank you. But we will always need manually adjustment, and adjustment is no work at all. I will explain. The 0,2 second before and 0,5 second after I mentioned are not the same for every sentence. They are depended on what you see on the video, the speed pattern of the presenter, the emotion, the emphasis, the visual expression of the presenter (face, moving arms etc).
"Now you see" must start at 0,0 second, and be available for 2 seconds after the video. And "As you would think" in a question tone needs 3 seconds and in a non question tone 1 second". But if the presenter also moves shows his open hands, or points, until the movement is finished.
So whatever you do, it is never good enough, and we must adjust it.

Editing subtitles is a lot of work, and easy to make mistakes.
But moving subtitles is now work, at all. Why? When you editing a video you see under the video the sound pattern. So you can see exactly where a sentence starts and where it ends. And above is a subtitle, for which you are sure it is within a range of 6 seconds of the right position. Why 6 seconds? That is the degree of freedom the presenter has in a script.

Cuperino / QPrompt-Teleprompter

Export recording of prompted performnce as subtitles or closed captions #22