Open jingcodeguy opened 3 weeks ago
Today, I have tried again with the following simple code to make sure everything follows the sample without other unknown factors.
import torch
from transformers import pipeline
MODEL_PATH = "/Users/local/Downloads/whisper-th-large-v3" # see alternative model names below
lang = "th"
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_PATH,
chunk_length_s=30,
device=device,
)
# Perform ASR with the created pipe.
text = pipe("test.wav", generate_kwargs={"language":"th", "task":"transcribe"}, batch_size=16)["text"]
# Specify the path to the output text file
output_text_file_path = "whisper-th-large-v3_output.txt"
# Write the transcribed text to the file
with open(output_text_file_path, "w") as file:
file.write(text)
print(f"Transcription saved to {output_text_file_path}")
And this is the transcribed result for your reference. whisper-th-large-v3_output.txt whisper-th-large-v3-combined_output.txt
@jingcodeguy thanks for the issue. I suspect it could be issues related to VAD before sending to the model. Here, model may see small chunk of audios which may cause hallucination. @z-zawhtet-a anything to add here?
@titipata Thanks for your feedback. I have tried also the original version of Whisper and Whisper.cpp. Both generate sensible words most of them. Because I am not a Thai-expertise. I cannot estimate overall accuracy in those tools also. I can just confirm by using text to speech with the transcribed words and then listening to the original with VLC to see if it sounds too difference at the moment.
Maybe it is from the audio sampling rate? Just guessing here.
I have the following findings to share for your reference to help improve the model in the future.
To ensure the model is working properly, I first made a simple wav file "สวัสดีครับ"(thank you). The original sound is from Microsoft TTS and sounds very natural. Since the service provide mp3. Sp I tried 2 conversion method converting to wav. One is FFMpeg, the other one is Audacity. The file is Stereo, sample rate is 44.1kHz It transcribes correctly.
Then I cut the portion of the test.wav
done before. This portion without any child sound, only the narrator.
The video is of low quality so the audio file is mono, sample rate is 16kHz.
It transcribes correctly. (according to the Google Translate of the words)
Then I slowly make hybrid audio files, I have made 2. One is adding "thank you" at the beginning + the beginning of the test.wav
.
After transcribing the word "thank you" correctly, it begins to hallucinate with non-sense words.
The second file is, I combine step 1 "thank you" and step 2 "narrator for the title" then a small clip with children's voice and adult voice. After transcribing the word "thank you" and the "narrator's title" correctly, it begins to hallucinate with non-sense words.
The hugging face suggested way of using the model is used. (the code in the previous comment)
According to the observation of 3 and 4. When this model cannot distinguish the child sound, it begins to fish away and hallucinate.
a. The whisper.cpp version's ggml-large-v3.bin model can recognize the children's sound/voice without hallucinate or distracted. b. The original OpenAI whisper large model cannot recognize well of the children sound but it will not hallucinate.
Attached are the sample sound and result I have made for your research.
That's a cool finding! Let me ingest the information and probably think about model a bit more later.
Hello!
Thanks for providing the hope about using the Thai language inference with better accuracy. I have tried the following methods but none could give any meaningful words compared to the existing model. I have tried
whisper-th-large-v3-combined
whisper-th-large-v3
whisper-th-medium-combined
respectively in the following tools.eg. https://huggingface.co/biodatlab/whisper-th-large-v3-combined
System
The first thing I have done is cloning your project to local for a test.
Sample audio from this video https://www.tiktok.com/@minnimum111/video/7245259683211398406
Is there any procedure I have missed to use your model?