Closed Melody-SHANG closed 3 months ago
Could you let me know what the configuration of the machine is (M2 chip type—Pro? Ultra?) and memory (16GB? 32GB?) etc.? ASR is fairly heavy, and sadly will be slow especially on smaller machines. Also, it is possible that the model is being downloaded, which will take extra time.
Sadly, the bar for ASR gets stuck at 0% until the transcript is done significantly (a quarter or halfway) as there is no reliable way for us to discern incremental progress.
Thanks!
Thank you for the prompt reply. I'm using M2 Pro 16GB device. Very likely the reasons you mentioned. It took around 50 min to transcribe, first try, 5min mp3 audio 5.2MB; around 3min to generate morphtag; and around 50 min to align. Looking forward to future updates on this project.
I’m guessing you were using Whisper. It seems to take about 5X more time than Rev-AI.
—Brian MacWhinney
On Aug 7, 2024, at 2:56 AM, Melody @.***> wrote:
Thank you for the prompt reply. I'm using M2 Pro 16GB device. Very likely the reasons you mentioned. It took around 50 min to transcribe, first try, 5min mp3 audio 5.2MB; around 3min to generate morphtag; and around 50 min to align. Looking forward to future updates on this project. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi Prof. MacWhinney, Thank you for getting back to me. I tried Rev-AI on some adult English narratives. Minimal difference from manual transcription. Amazing. The Mandarin transcripts however are still far from being done manually. Since it’s possible to incorporate paid services like Rev-AI, I wonder if it’s possible for batchalign to integrate other tools and have more choices in the future (our team used voice recognition software from iFlytek, paid services and very accurate in Mandarin) that is single-language-based and might outperform whisper. Best regards, Melody
Interesting idea. I will check with Houjun about this. A lot depends on whether iFlytek provides an API.
— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU
On Aug 8, 2024, at 2:50 AM, Melody @.***> wrote:
Hi Prof. MacWhinney, Thank you for getting back to me. I tried Rev-AI on some adult English narratives. Minimal difference from manual transcription. Amazing. The Mandarin transcripts however are still far from being done manually. Since it’s possible to incorporate paid services like Rev-AI, I wonder if it’s possible for batchalign to integrate other tools and have more choices in the future (our team used voice recognition software from iFlytek, paid services and very accurate in Mandarin) that is single-language-based and might outperform whisper. Best regards, Melody — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Looks like iFlytek is a product-first (hardware?) company? If they provide an API I'm of course happy to write an adaptor for Batchalign.
Looks like iFlytek is a product-first (hardware?) company? If they provide an API I'm of course happy to write an adaptor for Batchalign.
Hi Houjun, pls check if they posted any updated sources for API here: https://global.xfyun.cn/doc/rtasr/rtasr/API.html. I remember they used to support developers of all kinds. Just not sure if they're still doing it. Best, Melody
Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well.
Thanks! —Jack
Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well.
Thanks! —Jack
(.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> batchalign transcribe --lang=zho D:\AutoTrans\transcribe D:\AutoTrans\output
Mode: transcribe; got 1 transcript to process from D:\AutoTrans\transcribe:
GZ0817_anan_MOT_ManMAIN_Cat.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:00:02 FAIL
ERROR on file GZ0817_anan_MOT_ManMAIN_Cat.mp3: 400 Client Error: Bad Request for url: https://api.rev.ai/speechtotext/v1/jobs; Server Response : {"parameters":{"options.speakers_count":["This option is only allowed for the following languages: ['en', 'en-us', 'en-gb', 'es', 'fr', 'pt']"]},"type":"https://www.rev.ai/api/v1/errors/invalid-parameters","title":"Your request parameters didn't validate","status":400,"extensions":{}}
Best, Melody
I believe that Melody said that Rev-AI did not do well for Mandarin. But didn’t we already see that it wasn’t bad. If Melody could run a comparison test between iFly and Rev-AI that might help.
— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU
On Aug 12, 2024, at 2:45 AM, Houjun Liu @.***> wrote:
Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well. Thanks! —Jack — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Perhaps our earlier test for Mandarin relied on Whisper. Does iFlyTek also do better than Whisper?
— Brian MacWhinney Teresa Heinz Professor of Cognitive Psychology, Language Technologies and Modern Languages, CMU
On Aug 12, 2024, at 3:09 AM, Melody @.***> wrote:
Got it. This is certainty interesting to investigate. Looks like this is a company based in + processing data in China, which I'm worried that folks will have IRB issues with (i.e. HIPPA etc.) Prof. MacWhinney—I wonder if it would be interesting to investigate iFlyTek (https://global.xfyun.cn/doc/rtasr/rtasr/API.html) as melody said? Melody—will investigate further, though I wonder how Rev / Whisper does. Perhaps they already do fairly well. Thanks! —Jack Hi Jack, Glad you are interested. Rev / Whisper did excellent work on adult English narratives, the English cha output is very close to manual transcription. Whisper did not do well on adult Mandarin narratives, the cha output is far from the transcript done manually. The final Mandarin cha output has problems like word/utterance segmentation errors, missing words, and wrong characters (Chinese words). Rev only supports a few languages ('en', 'en-us', 'en-gb', 'es', 'fr', 'pt') if I did it right, see below: (.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> batchalign transcribe --lang=zho D:\AutoTrans\transcribe D:\AutoTrans\output Mode: transcribe; got 1 transcript to process from D:\AutoTrans\transcribe: GZ0817_anan_MOT_ManMAIN_Cat.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:00:02 FAIL ERROR on file GZ0817_anan_MOT_ManMAIN_Cat.mp3: 400 Client Error: Bad Request for url: https://api.rev.ai/speechtotext/v1/jobs; Server Response : {"parameters":{"options.speakers_count":["This option is only allowed for the following languages: ['en', 'en-us', 'en-gb', 'es', 'fr', 'pt']"]},"type":"https://www.rev.ai/api/v1/errors/invalid-parameters","title":"Your request parameters didn't validate","status":400,"extensions":{}} (.venv) PS C:\Users\LIN\PycharmProjects\pythonProject1> Best, Melody — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Thank you so much for sharing this wonderful package. I've run batchalign transcribe in a Mac mini with Apple M2 and batchalign was successfully installed. batchalign --help also works. The problem is the transcription freezes at 0% for a dozen minutes. I'd like to know how long it usually takes to transcribe a 5min .mp3 audio. Below see my terminal output:
melody@MengyaodeMac-mini ~ % batchalign transcribe --lang=zho --whisper /Users/melody/Documents/AutoTools/input /Users/melody/Documents/AutoTools/output
Mode: transcribe; got 1 transcript to process from /Users/melody/Documents/AutoTools/input:
You have passed task=transcribe, but also have set
forced_decoder_ids
to [[1, None], [2, 50360]] which creates a conflict.forced_decoder_ids
will be ignored in favor of task=transcribe. The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input'sattention_mask
to obtain reliable results. WhisperModel is using WhisperSdpaAttention, buttorch.nn.functional.scaled_dot_product_attention
does not supportoutput_attentions=True
orlayer_head_mask
not None. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argumentattn_implementation="eager"
when loading the model. ⠋ BJ6720_xingxing.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:07:34 Running: ASRmelody@MengyaodeMac-mini ~ % pip install -U numpy Requirement already satisfied: numpy in /opt/homebrew/lib/python3.11/site-packages (2.0.1)[notice] A new release of pip is available: 24.1.2 -> 24.2 [notice] To update, run: python3.11 -m pip install --upgrade pip melody@MengyaodeMac-mini ~ % pip install numpy==1.26.4 Collecting numpy==1.26.4 Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl.metadata (114 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.8/114.8 kB 6.4 MB/s eta 0:00:00 Downloading numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 10.5 MB/s eta 0:00:00 Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 2.0.1 Uninstalling numpy-2.0.1: Successfully uninstalled numpy-2.0.1 ⠏ BJ6720_xingxing.mp3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:14:46 Running: ASR