Suggestion: integrate as an app in Google Drive itself

Manamama commented 1 year ago

Thank you for creating this and especially the extensive documentation. (I had started to cobble something similar, which works, but then gave up on it given a couple of more elegant competing options, including yours.)

Just a tip: maybe create it as a Google Drive app and post it on their Marketplace for even easier integration for end-users? (I wonder if they accept it as it is quite "Collab intense"...)

Manamama commented 1 year ago

Another suggestion: See e.g. this this sample: https://www.youtube.com/watch?v=JEHyXDZTDK4 It starts with Polish and then switches to the original Russian, with the manually curated and hard-coded ready Polish subtitles. At the default Whisper engine language detection option (Polish), the poor Whisper is trying to valiantly transcribe the subsequent Russian as Polish, and is actually making a couple of correct guesses. The best results are of course with the "translate" option, meant for such a case, see a sample below, but I wonder how to make it switch between the languages mid-stream. My initial (and naive) proposal:

Sample it randomly at every minute interval to detect the language of a segment.
Collect the results into a table, maybe using a basic statistical tool
Trim and recognize these segments, or, better:
Rerun recognition with all the detected languages fed as a parameter, taken from Step 2
Collate the results somehow into a multilingual SRT file or even HTML (see parallel texts format), so that the end-user may choose them when watching the video, analyzing the ready parallel translations, etc.

[05:16.000 --> 05:27.000] And it sounds terrible to us, that if you know yourself and you know someone else's army, then you will meet a hundred times and you will win a hundred times. [05:27.000 --> 05:35.000] If you know yourself and you do not know him, then once you win, you will win them. [05:35.000 --> 05:40.000] If you do not know yourself and you do not know them, then you will meet a hundred times and you will lose a hundred times. [05:40.000 --> 05:45.000] So that the third scenario does not happen, I have a question, do we know them? [05:45.000 --> 05:48.000] Do we study them?

ArthurFDLR commented 1 year ago

Hey, thanks for your input! Creating a proper Web Application would require way more work than I'm willing to put into this project. But the idea is very interesting. Also, I agree that optionally enforcing a more refined language detection would be very interesting, but this particular repository only focuses on the high-level implementation of the Colab interface. It would be best if you opened a discussion on the actual repository of the Whisper project about your suggestion.

ArthurFDLR / whisper-youtube

Suggestion: integrate as an app in Google Drive itself #1