Add options for handling multilingual input

jsichi commented 7 months ago

This is an incomplete PR intended to start on addressing issues semi-related to #184.

The multilingual_input option controls whether multiple languages should be expected in the input stream. If False (the backwards compatible default), only one language is expected, and it will be either the one specified by the client, or the first one heard if none was specified by the client. If True, the language can change throughout the stream, and for transcription, this will result in a multilingual text. Notifications will be sent to the client whenever a language change is detected. If the pauses between utterances in different languages are not long enough, the transcript boundaries may be incorrect, i.e. the first sentence in the new language may be incorrectly transcribed in the previous language. This seems currently unavoidable due to the way the last work-in-progress segment gets reprocessed.

The lang_filter option allows the client to restrict the candidate set of languages for which to listen. This may be useful regardless of the multilingual_input setting, e.g. at the beginning of the input where the actual language may be incorrectly detected initially. If not set (the backwards compatible default), all known languages are listened for.

If there's interest in adding these, I can propagate them to the TensorRT code as well. I'm not sure how to add tests since that would require using a large multilingual model (we would also need to add some multilingual samples, which might be useful anyway).

AdolfVonKleist commented 2 months ago

Overall, how reliable is this in general, and compared to say, what happens when you have no special filtering/processing in place? Do you have any objective benchmark? I'm interested in making use of a similar approach locally.

jsichi commented 2 months ago

It's been a while since I worked on this, but it was a noticeable improvement. I don't have any benchmark for you.

collabora / WhisperLive

Add options for handling multilingual input #200