SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.67k stars 1.06k forks source link

faster-whisper vs whisper: PyAV stops during decode, ffmpeg continues #988

Open rodrigofvale opened 2 months ago

rodrigofvale commented 2 months ago

The audio file is corrupted at the end, so an error is expected during decode process. However, PyAV stop processing while whisper using ffmpeg process the file until the corrupted are is detected.

Expected behavior: PyAV to process valid part of the file and throw an waning message.

Can we add a parameter to faster-whisper to behavior like whisper? i.e. process the file until the corrupt part.

Workaround: use ffmpeg command line to export corrupted file - discarding invalid data - and run faster-whisper on top of new file processed by ffmpeg. This is a waste of processing if we can have a parameter to ignore corrupted content.

MahmoudAshraf97 commented 2 months ago

can you upload the file to test?

rodrigofvale commented 2 months ago

12487430.mp4a.gz

This is the python source code

`import time from faster_whisper import WhisperModel import logging

logging.basicConfig() logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

model_size = "large-v3" model = WhisperModel(model_size, device="cuda", compute_type="float16")

def speech2text(fileName): segments, info = model.transcribe(fileName, beam_size=5, language="pt") text = "" for segment in segments: text = text + "[{0:.2f}s -> {1:.2f}s] {2}\n".format(segment.start, segment.end, segment.text) text = text + "----\n" print(text)

if name =="main": start_time = time.time() speech2text("/mnt/media/audio/v1/2024-09-02/12487430.mp4a") print("--- %s seconds ---" % (time.time() - start_time)) print("Done!") `

This is the python3 output

Traceback (most recent call last): File "/mnt/bin/fastwhisper.py", line 21, in <module> speech2text("/mnt/media/audio/v1/2024-09-02/12487430.mp4a") File "/mnt/bin/fastwhisper.py", line 12, in speech2text segments, info = model.transcribe(fileName, beam_size=5, language="pt") File "/home/rodrigo/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 319, in transcribe audio = decode_audio(audio, sampling_rate=sampling_rate) File "/home/rodrigo/.local/lib/python3.10/site-packages/faster_whisper/audio.py", line 52, in decode_audio for frame in frames: File "/home/rodrigo/.local/lib/python3.10/site-packages/faster_whisper/audio.py", line 103, in _resample_frames for frame in itertools.chain(frames, [None]): File "/home/rodrigo/.local/lib/python3.10/site-packages/faster_whisper/audio.py", line 90, in _group_frames for frame in frames: File "/home/rodrigo/.local/lib/python3.10/site-packages/faster_whisper/audio.py", line 80, in _ignore_invalid_frames yield next(iterator) File "av/container/input.pyx", line 207, in decode File "av/container/input.pyx", line 166, in demux File "av/container/core.pyx", line 286, in av.container.core.Container.err_check File "av/error.pyx", line 326, in av.error.err_check av.error.OSError: [Errno 5] Input/output error: '/mnt/media/audio/v1/2024-09-02/12487430.mp4a'

This is the output of ffmpeg (it shows the Input/output error but export the file to /tmp/1.mp3 ffmpeg -v debug -i /mnt/media/audio/v1/2024-09-02/12487519.mp4a /tmp/1.mp3

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 11 (Ubuntu 11.2.0-19ubuntu1) configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 70.100 / 56. 70.100 libavcodec 58.134.100 / 58.134.100 libavformat 58. 76.100 / 58. 76.100 libavdevice 58. 13.100 / 58. 13.100 libavfilter 7.110.100 / 7.110.100 libswscale 5. 9.100 / 5. 9.100 libswresample 3. 9.100 / 3. 9.100 libpostproc 55. 9.100 / 55. 9.100 Splitting the commandline. Reading option '-v' ... matched as option 'v' (set logging level) with argument 'debug'. Reading option '-i' ... matched as input url with argument '/mnt/media/audio/v1/2024-09-02/12487519.mp4a'. Reading option '/tmp/1.mp3' ... matched as output url. Finished splitting the commandline. Parsing a group of options: global . Applying option v (set logging level) with argument debug. Successfully parsed a group of options. Parsing a group of options: input url /mnt/media/audio/v1/2024-09-02/12487519.mp4a. Successfully parsed a group of options. Opening an input file: /mnt/media/audio/v1/2024-09-02/12487519.mp4a. [NULL @ 0x561af38fb680] Opening '/mnt/media/audio/v1/2024-09-02/12487519.mp4a' for reading [file @ 0x561af38fc300] Setting default whitelist 'file,crypto,data' [aac @ 0x561af38fb680] Format aac probed with size=32768 and score=50 [aac @ 0x561af38fb680] Before avformat_find_stream_info() pos: 229 bytes read:65696 seeks:4 nb_streams:1 [aac @ 0x561af38fb680] All info found [aac @ 0x561af38fb680] Estimating duration from bitrate, this may be inaccurate [aac @ 0x561af38fb680] After avformat_find_stream_info() pos: 9852 bytes read:65696 seeks:4 frames:50 Input #0, aac, from '/mnt/media/audio/v1/2024-09-02/12487519.mp4a': Duration: 00:03:02.12, bitrate: 66 kb/s Stream #0:0, 50, 1/28224000: Audio: aac (LC), 44100 Hz, stereo, fltp, 66 kb/s Successfully opened the file. Parsing a group of options: output url /tmp/1.mp3. Successfully parsed a group of options. Opening an output file: /tmp/1.mp3. File '/tmp/1.mp3' already exists. Overwrite? [y/N] Y [file @ 0x561af392f500] Setting default whitelist 'file,crypto,data' Successfully opened the file. Stream mapping: Stream #0:0 -> #0:0 (aac (native) -> mp3 (libmp3lame)) Press [q] to stop, [?] for help cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream) detected 12 logical cores [graph_0_in_0_0 @ 0x561af391f580] Setting 'time_base' to value '1/44100' [graph_0_in_0_0 @ 0x561af391f580] Setting 'sample_rate' to value '44100' [graph_0_in_0_0 @ 0x561af391f580] Setting 'sample_fmt' to value 'fltp' [graph_0_in_0_0 @ 0x561af391f580] Setting 'channel_layout' to value '0x3' [graph_0_in_0_0 @ 0x561af391f580] tb:1/44100 samplefmt:fltp samplerate:44100 chlayout:0x3 [format_out_0_0 @ 0x561af3a002c0] Setting 'sample_fmts' to value 's32p|fltp|s16p' [format_out_0_0 @ 0x561af3a002c0] Setting 'sample_rates' to value '44100|48000|32000|22050|24000|16000|11025|12000|8000' [format_out_0_0 @ 0x561af3a002c0] Setting 'channel_layouts' to value '0x4|0x3' [AVFilterGraph @ 0x561af392fb00] query_formats: 4 queried, 9 merged, 0 already done, 0 delayed Output #0, mp3, to '/tmp/1.mp3': Metadata: TSSE : Lavf58.76.100 Stream #0:0, 0, 1/44100: Audio: mp3, 44100 Hz, stereo, fltp, delay 1105 Metadata: encoder : Lavc58.134.100 libmp3lame cur_dts is invalid st:0 (0) [init:1 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream) Last message repeated 2 times /mnt/media/audio/v1/2024-09-02/12487519.mp4a: Input/output error2x [out_0_0 @ 0x561af3a00a80] EOF on sink link out_0_0:default. No more output streams to write to, finishing. [libmp3lame @ 0x561af390dcc0] Trying to remove 815 more samples than there are in the queue size= 2821kB time=00:03:00.48 bitrate= 128.0kbits/s speed=96.9x video:0kB audio:2821kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.008759% Input file #0 (/mnt/media/audio/v1/2024-09-02/12487519.mp4a): Input stream #0:0 (audio): 7773 packets read (1509612 bytes); 7773 frames decoded (7959552 samples); Total: 7773 packets (1509612 bytes) demuxed Output file #0 (/tmp/1.mp3): Output stream #0:0 (audio): 6910 frames encoded (7959552 samples); 6911 packets muxed (2888515 bytes); Total: 6911 packets (2888515 bytes) muxed 7773 frames successfully decoded, 0 decoding errors [AVIOContext @ 0x561af3932dc0] Statistics: 2 seeks, 13 writeouts [AVIOContext @ 0x561af3904680] Statistics: 1542770 bytes read, 4 seeks

This is the output for whisper whisper --verbose True --language pt --model large /mnt/media/audio/v1/2024-09-02/12487519.mp4a /home/rodrigo/.local/lib/python3.10/site-packages/whisper/__init__.py:146: FutureWarning: You are usingtorch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(fp, map_location=device) [00:00.000 --> 00:28.000] Música [00:30.000 --> 00:40.000] Música [01:00.000 --> 01:10.000] Música [01:10.000 --> 01:14.000] Música [01:30.000 --> 01:40.000] Música [01:40.000 --> 01:46.000] Música [02:00.000 --> 02:21.040] Super Mix, quarenta minutos de música na Mix. [02:30.000 --> 02:59.980] Super Mix, quarenta minutos de música na Mix. [03:00.000 --> 03:00.460] Música