burchim / AVEC

[WACV 2023] Audio-Visual Efficient Conformer (AVEC) for Robust Speech Recognition
https://openaccess.thecvf.com/content/WACV2023/html/Burchi_Audio-Visual_Efficient_Conformer_for_Robust_Speech_Recognition_WACV_2023_paper.html
Apache License 2.0
87 stars 9 forks source link

Like the Transcription Demo #13

Open bakhuiyong opened 6 months ago

bakhuiyong commented 6 months ago

Hello. Thanks for sharing good code.

I would like to convert the speech of a video into text in real time, like the Transcription Demo you posted.

How should I use the code you shared to work with the Transcription Demo?

burchim commented 4 months ago

Hi, sorry for the delay...

I created the demo by converting the network predictions to a caption file ".srt" format. The video with caption can then be created by giving the caption file + the video to any online video captioning website like this one

The only problem is that I do not find the peace of code I was using to convert predictions to a .srt file. The file includes the list of predicted words and their time. The time can be recovered since the CTC predictions are time aligned.