biodatlab / thonburian-whisper

Thonburian Whisper: Open models for fine-tuned Whisper in Thai. Try our demo on Huggingface space:
https://huggingface.co/spaces/biodatlab/whisper-thai-demo
MIT License
85 stars 10 forks source link

Adding kwargs to return timestamps during transcription #2

Closed titipata closed 1 year ago

titipata commented 1 year ago

Adding return_timestamps=True to pipe e.g.

text = pipe(
    "audio.mp3",
    return_timestamps=True,
    generate_kwargs={
        "language": "<|th|>",
        "task": "transcribe",
        "repetition_penalty": 1.2,
        "max_length": 448,
    },
    batch_size=16
)["text"]

Currently return_timestamps does not work. Let's explore the minor hacks to get pipeline to return timestamps. See this issue.

Error:

[/usr/local/lib/python3.8/dist-packages/transformers/generation/logits_process.py](https://localhost:8080/#) in __init__(self, generate_config)
    934     def __init__(self, generate_config):  # support for the kwargs
    935         self.eos_token_id = generate_config.eos_token_id
--> 936         self.no_timestamps_token_id = generate_config.no_timestamps_token_id
    937         self.timestamp_begin = generate_config.no_timestamps_token_id + 1
    938 

AttributeError: 'GenerationConfig' object has no attribute 'no_timestamps_token_id'
z-zawhtet-a commented 1 year ago

fixed by #1

titipata commented 1 year ago

For the new code, don't use return_timestamp=True.

loretoparisi commented 5 months ago

@titipata sorry how to return timestamps therefore?

I'm doing

predicted_ids = model.generate(input_features,
                                               max_length=model.config.max_target_positions,
                                               num_beams=5,
                                               length_penalty=1.0,
                                               do_sample=do_sample,
                                               temperature=temperature,
                                               return_timestamps=return_timestamps,
                                               early_stopping=early_stopping)
 # Decode the predicted IDs to text and append to results
 transcription_chunk = processor.batch_decode(predicted_ids,
                                                             task=task,
                                                             skip_special_tokens=skip_special_tokens,
                                                             return_timestamps=return_timestamps)[0]
transcriptions.append(transcription_chunk)

but it does not seems to work, Transformers version was 4.41.2

titipata commented 5 months ago

@loretoparisi, I suggest combining VAD with our model instead. Using the model with a returned timestamp does not work well and might return the wrong timestamps.