Implement distli-whisper

ooe1123 commented 11 months ago

https://github.com/axinc-ai/ailia-models/issues/1306 のPRです。

kyakuno commented 11 months ago

M2 MPS

distli-whisper

 INFO distil-whisper.py (297) :     total processing time 24532 ms
 INFO distil-whisper.py (299) : He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.

whisper-large-v2

[00:00.000 --> 00:06.000]  He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton
[00:06.000 --> 00:10.200]  pieces to be ladled out in thick, peppered, flour-fattened sauce.
 INFO whisper.py (927) :    total processing time 51876 ms
 INFO whisper.py (936) : Script finished successfully.

kyakuno commented 11 months ago

distli-whisper

 INFO distil-whisper.py (260) :     encoder processing time 5837 ms
 INFO distil-whisper.py (152) :     decoder processing time 304 ms
 INFO distil-whisper.py (152) :     decoder processing time 318 ms
 INFO distil-whisper.py (152) :     decoder processing time 328 ms
 INFO distil-whisper.py (152) :     decoder processing time 349 ms
 INFO distil-whisper.py (152) :     decoder processing time 298 ms
 INFO distil-whisper.py (152) :     decoder processing time 295 ms

whisper-large-v2

 INFO whisper.py (921) : BENCHMARK mode
 INFO whisper.py (383) :    encoder processing time 16081 ms
 INFO whisper.py (468) :    decoder processing time 2390 ms
 INFO whisper.py (739) : Detected language: English
 INFO whisper.py (383) :    encoder processing time 15920 ms
 INFO whisper.py (468) :    decoder processing time 2221 ms
 INFO whisper.py (468) :    decoder processing time 2006 ms
 INFO whisper.py (468) :    decoder processing time 2272 ms
 INFO whisper.py (468) :    decoder processing time 289 ms
 INFO whisper.py (468) :    decoder processing time 245 ms

kyakuno commented 11 months ago

@ooe1123 現状の実装は、Short-Form Transcriptionとなっており、短い音声のみテキスト化可能かと思いますが、Long-Form Transcriptionへの対応は可能でしょうか？ https://github.com/huggingface/distil-whisper

kyakuno commented 11 months ago

chunk_length_sが該当するようです。公式のWhisperは、timestampで切っていきますが、distli-whisperの場合は、オーバラップしながら日本語変換して、後でテキストを繋げる動きをするみたいですね。 https://huggingface.co/blog/asr-chunking

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large",
  chunk_length_s=30,
  device=device,
)

5 チャンク分割された長い形式の転写
Whisperモデルは、30秒の入力オーディオに対応する固定の受容野を持っており、一度に長いオーディオ入力を処理することはできません。ほとんどの学術的なデータセットは、30秒未満の短い発話で構成されているため、これは問題ではありません。しかし、会議の転写などの実世界のアプリケーションでは、通常、数分または数時間の長いオーディオファイルを転写する必要があります。
オリジナルのWhisper論文では、30秒のオーディオセグメントを順次転写し、モデルによって予測されたタイムスタンプに基づいてスライディングウインドウをシフトする長い形式の転写アルゴリズムが提案されています。この自己回帰アルゴリズムでは、ビームサーチと温度のフォールバックの両方が必要です。これにより、正確な長い形式の転写が保証されます（Radford et al.、2022）。
私たちは、Patry（2022）によって最初に提案された代替戦略を使用しています。この戦略では、長いオーディオファイルを小さなセグメントに分割し、隣接するセグメント間に少しのオーバーラップを持たせます。モデルは各チャンクで実行され、推論されたテキストは、オーバーラップ間で最も長い共通のシーケンスを見つけることによってストライドで結合されます。このストライドにより、シーケンシャルに転写することなく、チャンク全体で正確な転写が可能になります。このアルゴリズムは、貪欲なデコーディングのみを必要とし、長いオーディオファイルを信頼性のある方法で転写することができます。さらに、このアルゴリズムは、チャンクを順不同で転写することができるという意味で、半自己回帰的です。ただし、隣接するチャンクは境界で正しく結合する必要があります。これにより、チャンクを並列にバッチ処理して転写することが可能になります。実際には、これによりシーケンシャルな転写に比べて推論速度が最大9倍向上します。この研究では、WhisperモデルとDistil-Whisperモデルの両方を評価する際に、チャンク分割された長い形式の転写アルゴリズムを使用します。

ooe1123 commented 11 months ago

@kyakuno すいません、理解がおいついておらず。こちら対応するべきことがありますでしょうか。 chunk_length_s のパラメータの役割をみて、chunk_length_s をサンプルコードでも反映できるようにすればよいでしょうか。

kyakuno commented 11 months ago

@ooe1123 現状のモデルですと、最大で30秒しか入力できないため、長い音声を入力した場合にエラーになるか、認識結果が不正になるかと考えています。この問題の対策として、chunk_length_sが実装されていまして、chunk_length_sをサポートすることで、通常のWhisperと同様に、長い音声にも適用できるようにできないかと考えています。

ooe1123 commented 11 months ago

@kyakuno chunk_length_sパラメータに対応しました。

kyakuno commented 11 months ago

ありがとうございます！

メモ：_decode_asr APIにchunk_length_sでオーバラップしたトークン列を与えると、timestampを見て適切に切りはりしてマージしてくれる模様。 https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/tokenization_whisper.py

axinc-ai / ailia-models

Implement distli-whisper #1321