SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
10.18k stars 855 forks source link

Is there any way we can get an example how to translate? #539

Open BBC-Esq opened 8 months ago

BBC-Esq commented 8 months ago

I only see examples pertaining to transcription, but can't faster-whisper also be used for translation and include parameters such as beam size, specifying the language translating from, etc., all the parameters in the WhisperModel class?

DKWoods commented 8 months ago

If I understand things correctly, all you need to do is add the parameter "task='translate'" to your model.transcribe() call. All other parameters for WhisperModel() and model.transcribe() should continue to function as usual. At least, when I added this parameter, the program gave me an English translation of the Japanese data I submitted.

blackpolarz commented 8 months ago

I actually used faster-whisper for translation. Currently, at least from what I know, you can only translate from other languages to English in faster whisper. To do so, simply 1) import the model, _from fasterwhisper import WhisperModel 2) initialize it
_transcriber = WhisperModel("large-v2", device="cuda", computetype="float16") 3) Call the transcriber. segment,info = transcriber.transcribe(audio, task = "translate", ) 4) Convert the segment to something readable. _segment_list = list(segment) data = segmentlist[0] text = data[4]

Feel free to play around with this example. For other parameters, you can simply add them into the Whisper Model as required.

BBC-Esq commented 8 months ago

Thanks for the responses guys. Here's my current script:

import os
from faster_whisper import WhisperModel
import time
from termcolor import cprint
import torch
import gc

class TranscribeFile:
    def __init__(self, model_name="ctranslate2-4you/whisper-small.en-ct2-float16", device="cuda", compute_type="float16"):
        self.audio_file = 'test.mp3'
        self.include_timestamps = True
        self.model = WhisperModel(model_name, device=device, compute_type=compute_type)
        self.enable_print = True
        self.my_cprint("Whisper model loaded", "green")

    def my_cprint(self, *args, **kwargs):
        if self.enable_print:
            filename = "transcribe_module.py"
            modified_message = f"{filename}: {args[0]}"
            cprint(modified_message, *args[1:], **kwargs)

    cprint("This should be red", "red")

    @staticmethod
    def format_time(seconds):
        # Converts seconds to 'hours:minutes:seconds' format if more than 59m59s
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        seconds = int(seconds % 60)
        if hours > 0:
            return f"{hours}:{minutes:02d}:{seconds:02d}"
        else:
            return f"{minutes}:{seconds:02d}"

    def transcribe(self, audio_file, output_file):
        segments, _ = self.model.transcribe(audio_file)
        transcription = []

        for segment in segments:
            if self.include_timestamps:
                start_time = self.format_time(segment.start)
                end_time = self.format_time(segment.end)
                transcription.append(f"{start_time} - {end_time} {segment.text}")
            else:
                transcription.append(segment.text)

        transcription_text = "\n".join(transcription)

        with open(output_file, 'w', encoding='utf-8') as file:
            file.write(transcription_text)

        return transcription_text

    def transcribe_to_file(self):
        if not os.path.isfile(self.audio_file):
            raise FileNotFoundError(f"Error: {self.audio_file} does not exist.")

        output_file = os.path.splitext(self.audio_file)[0] + '.txt'
        return self.transcribe(self.audio_file, output_file)

# Usage
if __name__ == "__main__":
    transcriber = TranscribeFile()
    try:
        start_time = time.time()
        transcription = transcriber.transcribe_to_file()
        end_time = time.time()
        print("Transcription completed.")
        print(f"Transcription took {end_time - start_time:.2f} seconds.")
    except FileNotFoundError as e:
        print(e)

So you're saying that I only need to add task = "translate" to the line that currently reads segments, _ = self.model.transcribe(audio_file)? Why wouldn't the parameter be added to the top portion where it chooses the device and compute_type? I'm not a programmer by trade, but it seems like any and all parameters would go there?

There's also a bunch of other parameters that I can't figure out where they go, like beam size. Here they are:

https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py#L167

blackpolarz commented 8 months ago

Theoretically yes. Did you face any problem when you are running the code? The parameter task is not added to the WhisperModel where the device and compute type is located as the transcribe is a function within the whisper model. While it would be possible to dump all the parameters inside the whisper model, it is considered bad practice in my opinion as it prevents anyone else from reusing the same WhisperModel for other tasks. For example, you can initialize the WhisperModel, then use the same WhisperModel to transcribe (print the transcription) and then translate (print the translation) assuming you have enough vram.

To understand where each parameter goes, you need to understand the concept of classes or object oriented programming. But to keep it simple, you can tell where each parameter goes by looking at the function and see if they are parameters within the function.

I am no expert when it comes to NLP but I can briefly explain the parameters within the transcribe function if you would like to tune the transcription.
beam_size: Parameter based on beam search algorithm. It controls the spread of possible words or you can call it the span of trees. You might want to google beam search to learn more about it. The larger the number, the larger the spread and the more computation it requires. Most people typically keep it at max of 10, any more will likely to have diminishing returns. best_of: N best segments. It controls the maximum number of solutions that which the transcriber outputs. patience: This is a tricky one. It generalizes the stopping criterion and provides flexibility to the depth of search. You might want to google Beam decoding with controlled patience to learn more. length_penalty: This is basically a penalizing factor to reduce the length. Best untouched. repetition_penalty: This is used to penalize repetition in the segments generated. temperature: Temperature is a hyperparameter that can be used to control the randomness and creativity of the generated text in a generative language model. In this context, it is only used when beam search did not provide a satisfactory result.

The rest are pretty self explanatory.

BBC-Esq commented 8 months ago

Thanks for the thorough response. That makes sense all of what you said. I'm no expert, but over the last several months I've taught myself Python and now know enough that what you said makes sense. I haven't tried adding the "task" parameter yet but will try. Ultimately, I want to incorporate a transcriber (with a translation option) into multiple programs. Here is just one example:

https://github.com/BBC-Esq/ctranslate2-faster-whisper-transcriber

I'll follow up or close this issue depending on trying out what you suggested!