biodatlab / thonburian-whisper

Thonburian Whisper: Open models for fine-tuned Whisper in Thai. Try our demo on Huggingface space:
https://huggingface.co/spaces/biodatlab/whisper-thai-demo
MIT License
76 stars 10 forks source link

Cannot get correct translation from the model #7

Open jingcodeguy opened 3 weeks ago

jingcodeguy commented 3 weeks ago

Hello!

Thanks for providing the hope about using the Thai language inference with better accuracy. I have tried the following methods but none could give any meaningful words compared to the existing model. I have tried whisper-th-large-v3-combined whisper-th-large-v3 whisper-th-medium-combined respectively in the following tools.

eg. https://huggingface.co/biodatlab/whisper-th-large-v3-combined

System

The first thing I have done is cloning your project to local for a test.

git clone https://huggingface.co/biodatlab/whisper-th-large-v3-combined
  1. Using the sample code in the above page Because the sample code is not outputing anything in the screen. To monitor the process, I stream it to the tkinter so that I don't need to wait for the whole process finished to see the result.
import numpy as np
from transformers import pipeline
from pydub import AudioSegment
import tkinter as tk
import threading
import torch

# Set up the pipeline
MODEL_PATH = "/local/Downloads/whisper-th-large-v3"
lang = "th"
device = "mps" if torch.backends.mps.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    device=device,
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(
    language=lang,
    task="transcribe"
)

def audio_segment_to_numpy(audio_segment):
    """Convert a pydub AudioSegment to a numpy ndarray."""
    samples = np.array(audio_segment.get_array_of_samples())
    if audio_segment.channels == 2:
        # Stereo to mono conversion
        samples = samples.reshape(-1, 2).mean(axis=1)
    return samples.astype(np.float32) / np.iinfo(np.int16).max

def process_audio_chunk(chunk):
    """Process an audio chunk and return the transcription."""
    numpy_array = audio_segment_to_numpy(chunk)
    return pipe(numpy_array)["text"]

def stream_transcription(audio_file_path, chunk_length_ms=10000):
    audio = AudioSegment.from_file(audio_file_path)
    duration_ms = len(audio)

    for start_time in range(0, duration_ms, chunk_length_ms):
        end_time = min(start_time + chunk_length_ms, duration_ms)
        chunk = audio[start_time:end_time]
        text = process_audio_chunk(chunk)

        # Use Tkinter's `after` method to update the GUI
        def update_text_widget(text=text):
            text_widget.insert(tk.END, text + "\n")
            text_widget.yview(tk.END)

        root.after(0, update_text_widget)

def run_transcription():
    """Run the transcription process in a separate thread."""
    stream_transcription(audio_file_path)

# Tkinter setup
root = tk.Tk()
root.title("Transcription Output")
text_widget = tk.Text(root, wrap=tk.WORD)
text_widget.pack(expand=True, fill=tk.BOTH)

# Path to your audio file
audio_file_path = "test.wav"

# Start transcription in a separate thread
thread = threading.Thread(target=run_transcription)
thread.start()

# Start Tkinter main loop
root.mainloop()
  1. Convert to ggml using whisper.cpp using its conversion script convert-h5-to-ggml.py
  2. Convert to coreml using whisper.cpp using its conversion script generate-coreml-model.sh

Sample audio from this video https://www.tiktok.com/@minnimum111/video/7245259683211398406

Is there any procedure I have missed to use your model?

jingcodeguy commented 3 weeks ago

Today, I have tried again with the following simple code to make sure everything follows the sample without other unknown factors.

import torch
from transformers import pipeline

MODEL_PATH = "/Users/local/Downloads/whisper-th-large-v3" # see alternative model names below
lang = "th"

device = "mps" if torch.backends.mps.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    chunk_length_s=30,
    device=device,
)

# Perform ASR with the created pipe.
text = pipe("test.wav", generate_kwargs={"language":"th", "task":"transcribe"}, batch_size=16)["text"]

# Specify the path to the output text file
output_text_file_path = "whisper-th-large-v3_output.txt"

# Write the transcribed text to the file
with open(output_text_file_path, "w") as file:
    file.write(text)

print(f"Transcription saved to {output_text_file_path}")

And this is the transcribed result for your reference. whisper-th-large-v3_output.txt whisper-th-large-v3-combined_output.txt

titipata commented 3 weeks ago

@jingcodeguy thanks for the issue. I suspect it could be issues related to VAD before sending to the model. Here, model may see small chunk of audios which may cause hallucination. @z-zawhtet-a anything to add here?

jingcodeguy commented 3 weeks ago

@titipata Thanks for your feedback. I have tried also the original version of Whisper and Whisper.cpp. Both generate sensible words most of them. Because I am not a Thai-expertise. I cannot estimate overall accuracy in those tools also. I can just confirm by using text to speech with the transcribed words and then listening to the original with VLC to see if it sounds too difference at the moment.

titipata commented 3 weeks ago

Maybe it is from the audio sampling rate? Just guessing here.

jingcodeguy commented 2 weeks ago

I have the following findings to share for your reference to help improve the model in the future.

  1. To ensure the model is working properly, I first made a simple wav file "สวัสดีครับ"(thank you). The original sound is from Microsoft TTS and sounds very natural. Since the service provide mp3. Sp I tried 2 conversion method converting to wav. One is FFMpeg, the other one is Audacity. The file is Stereo, sample rate is 44.1kHz It transcribes correctly.

  2. Then I cut the portion of the test.wav done before. This portion without any child sound, only the narrator. The video is of low quality so the audio file is mono, sample rate is 16kHz. It transcribes correctly. (according to the Google Translate of the words)

  3. Then I slowly make hybrid audio files, I have made 2. One is adding "thank you" at the beginning + the beginning of the test.wav. After transcribing the word "thank you" correctly, it begins to hallucinate with non-sense words.

  4. The second file is, I combine step 1 "thank you" and step 2 "narrator for the title" then a small clip with children's voice and adult voice. After transcribing the word "thank you" and the "narrator's title" correctly, it begins to hallucinate with non-sense words.

The hugging face suggested way of using the model is used. (the code in the previous comment)

According to the observation of 3 and 4. When this model cannot distinguish the child sound, it begins to fish away and hallucinate.

a. The whisper.cpp version's ggml-large-v3.bin model can recognize the children's sound/voice without hallucinate or distracted. b. The original OpenAI whisper large model cannot recognize well of the children sound but it will not hallucinate.

Attached are the sample sound and result I have made for your research.

samples.zip

titipata commented 2 weeks ago

That's a cool finding! Let me ingest the information and probably think about model a bit more later.