Take too long to TRANSCRIBE a 5min Mandarin Chinese audio on macOS Sonoma 14.5

Melody-SHANG commented 3 months ago

Thank you so much for sharing this wonderful package. I've run batchalign transcribe in a Mac mini with Apple M2 and batchalign was successfully installed. batchalign --help also works. The problem is the transcription freezes at 0% for a dozen minutes. I'd like to know how long it usually takes to transcribe a 5min .mp3 audio. Below see my terminal output:

melody@MengyaodeMac-mini ~ % batchalign transcribe --lang=zho --whisper /Users/melody/Documents/AutoTools/input /Users/melody/Documents/AutoTools/output

Mode: transcribe; got 1 transcript to process from /Users/melody/Documents/AutoTools/input:

You have passed task=transcribe, but also have set forced_decoder_ids to [[1, None], [2, 50360]] which creates a conflict. forced_decoder_ids will be ignored in favor of task=transcribe. The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. WhisperModel is using WhisperSdpaAttention, but torch.nn.functional.scaled_dot_product_attention does not support output_attentions=True or layer_head_mask not None. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument attn_implementation="eager" when loading the model. ⠋ BJ6720_xingxing.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:07:34 Running: ASRmelody@MengyaodeMac-mini ~ % pip install -U numpy Requirement already satisfied: numpy in /opt/homebrew/lib/python3.11/site-packages (2.0.1)

[notice] A new release of pip is available: 24.1.2 -> 24.2 [notice] To update, run: python3.11 -m pip install --upgrade pip melody@MengyaodeMac-mini ~ % pip install numpy==1.26.4 Collecting numpy==1.26.4 Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.8/114.8 kB 6.4 MB/s eta 0:00:00 Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 10.5 MB/s eta 0:00:00 Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 2.0.1 Uninstalling numpy-2.0.1: Successfully uninstalled numpy-2.0.1 ⠏ BJ6720_xingxing.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:14:46 Running: ASR

Jemoka commented 3 months ago

Could you let me know what the configuration of the machine is (M2 chip type—Pro? Ultra?) and memory (16GB? 32GB?) etc.? ASR is fairly heavy, and sadly will be slow especially on smaller machines. Also, it is possible that the model is being downloaded, which will take extra time.

Sadly, the bar for ASR gets stuck at 0% until the transcript is done significantly (a quarter or halfway) as there is no reliable way for us to discern incremental progress.

Thanks!

Melody-SHANG commented 3 months ago

Thank you for the prompt reply. I'm using M2 Pro 16GB device. Very likely the reasons you mentioned. It took around 50 min to transcribe, first try, 5min mp3 audio 5.2MB; around 3min to generate morphtag; and around 50 min to align. Looking forward to future updates on this project.

macw commented 3 months ago

I’m guessing you were using Whisper. It seems to take about 5X more time than Rev-AI.

—Brian MacWhinney

On Aug 7, 2024, at 2:56 AM, Melody @.***> wrote:

Thank you for the prompt reply. I'm using M2 Pro 16GB device. Very likely the reasons you mentioned. It took around 50 min to transcribe, first try, 5min mp3 audio 5.2MB; around 3min to generate morphtag; and around 50 min to align. Looking forward to future updates on this project. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Melody-SHANG commented 3 months ago

Hi Prof. MacWhinney, Thank you for getting back to me. I tried Rev-AI on some adult English narratives. Minimal difference from manual transcription. Amazing. The Mandarin transcripts however are still far from being done manually. Since it’s possible to incorporate paid services like Rev-AI, I wonder if it’s possible for batchalign to integrate other tools and have more choices in the future (our team used voice recognition software from iFlytek, paid services and very accurate in Mandarin) that is single-language-based and might outperform whisper. Best regards, Melody

macw commented 3 months ago

Interesting idea. I will check with Houjun about this. A lot depends on whether iFlytek provides an API.

— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU

On Aug 8, 2024, at 2:50 AM, Melody @.***> wrote:

Hi Prof. MacWhinney, Thank you for getting back to me. I tried Rev-AI on some adult English narratives. Minimal difference from manual transcription. Amazing. The Mandarin transcripts however are still far from being done manually. Since it’s possible to incorporate paid services like Rev-AI, I wonder if it’s possible for batchalign to integrate other tools and have more choices in the future (our team used voice recognition software from iFlytek, paid services and very accurate in Mandarin) that is single-language-based and might outperform whisper. Best regards, Melody — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Jemoka commented 3 months ago

Looks like iFlytek is a product-first (hardware?) company? If they provide an API I'm of course happy to write an adaptor for Batchalign.

Melody-SHANG commented 3 months ago

Looks like iFlytek is a product-first (hardware?) company? If they provide an API I'm of course happy to write an adaptor for Batchalign.

Hi Houjun, pls check if they posted any updated sources for API here: https://global.xfyun.cn/doc/rtasr/rtasr/API.html. I remember they used to support developers of all kinds. Just not sure if they're still doing it. Best, Melody

Jemoka commented 3 months ago

Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well.

Thanks! —Jack

Melody-SHANG commented 3 months ago

Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well.

Thanks! —Jack

Hi Jack, Glad you are interested. Rev / Whisper did excellent work on adult English narratives, the English cha output is very close to manual transcription. Whisper did not do well on adult Mandarin narratives, the cha output is far from the transcript done manually. The final Mandarin cha output has problems like word/utterance segmentation errors, missing words, and wrong characters (Chinese words). Rev only supports a few languages ('en', 'en-us', 'en-gb', 'es', 'fr', 'pt') if I did it right, see below:

(.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> batchalign transcribe --lang=zho D:\AutoTrans\transcribe D:\AutoTrans\output

Mode: transcribe; got 1 transcript to process from D:\AutoTrans\transcribe:

GZ0817_anan_MOT_ManMAIN_Cat.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:00:02 FAIL

ERROR on file GZ0817_anan_MOT_ManMAIN_Cat.mp3: 400 Client Error: Bad Request for url: https://api.rev.ai/speechtotext/v1/jobs; Server Response : {"parameters":{"options.speakers_count":["This option is only allowed for the following languages: ['en', 'en-us', 'en-gb', 'es', 'fr', 'pt']"]},"type":"https://www.rev.ai/api/v1/errors/invalid-parameters","title":"Your request parameters didn't validate","status":400,"extensions":{}}

(.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1>

Best, Melody

macw commented 3 months ago

I believe that Melody said that Rev-AI did not do well for Mandarin. But didn’t we already see that it wasn’t bad. If Melody could run a comparison test between iFly and Rev-AI that might help.

— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU

On Aug 12, 2024, at 2:45 AM, Houjun Liu @.***> wrote:

Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well. Thanks! —Jack — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

macw commented 3 months ago

Perhaps our earlier test for Mandarin relied on Whisper. Does iFlyTek also do better than Whisper?

— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU

On Aug 12, 2024, at 3:09 AM, Melody @.***> wrote:

Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well. Thanks! —Jack Hi Jack, Glad you are interested. Rev / Whisper did excellent work on adult English narratives, the English cha output is very close to manual transcription. Whisper did not do well on adult Mandarin narratives, the cha output is far from the transcript done manually. The final Mandarin cha output has problems like word/utterance segmentation errors, missing words, and wrong characters (Chinese words). Rev only supports a few languages ('en', 'en-us', 'en-gb', 'es', 'fr', 'pt') if I did it right, see below: (.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> batchalign transcribe --lang=zho D:\AutoTrans\transcribe D:\AutoTrans\output Mode: transcribe; got 1 transcript to process from D:\AutoTrans\transcribe: GZ0817_anan_MOT_ManMAIN_Cat.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:00:02 FAIL ERROR on file GZ0817_anan_MOT_ManMAIN_Cat.mp3: 400 Client Error: Bad Request for url: https://api.rev.ai/speechtotext/v1/jobs; Server Response : {"parameters":{"options.speakers_count":["This option is only allowed for the following languages: ['en', 'en-us', 'en-gb', 'es', 'fr', 'pt']"]},"type":"https://www.rev.ai/api/v1/errors/invalid-parameters","title":"Your request parameters didn't validate","status":400,"extensions":{}} (.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> Best, Melody — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

TalkBank / batchalign2

Take too long to TRANSCRIBE a 5min Mandarin Chinese audio on macOS Sonoma 14.5 #4

(.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1>