clamsproject / app-whisper-wrapper

Apache License 2.0
0 stars 0 forks source link

app is unexpectedly slow #24

Open keighrim opened 2 months ago

keighrim commented 2 months ago

Bug Description

Originally reported by @owencking via slack


I was scripting app-whisper-wrapper to run in batches. It is unexpectedly slow.

Reproduction steps

This takes roughly 30 minutes:

$ time docker run -v /mnt/c/Users/owen_king/batchmedia/shipment/:/data -v /mnt/c/Users/owen_king/kitchen/stovetop/clams_whisper_batch_test/mmif/:/mmif -i --rm ghcr.io/clamsproject/app-whisper-wrapper:v8 python cli.py --pretty true --modelSize tiny --modelLang en-US --runningTime true --hwFetch true /mmif/cpb-aacip-002d7648e70_CWhsprTest01_0.mmif /mmif/cpb-aacip-002d7648e70_CWhsprTest01_1-b.mmif
100%|█████████████████████████████████████| 72.1M/72.1M [00:01<00:00, 38.9MiB/s]

real    30m34.649s
user    0m0.020s
sys     0m0.036s

For comparison, this takes roughly 2-3 minutes:

$ time whisper --fp16 False --language en --model tiny cpb-aacip-002d7648e70.mp4 > /dev/null

real    2m18.165s
user    19m47.477s
sys     0m22.580s

In both cases, I'm running on my GBH laptop using WSL. Any idea what might explain the difference?

For comparison, with the new app-distil-whisper-wrapper this took less than 5 minutes. That's using "small" instead of "tiny".

$ time docker run -v /mnt/c/Users/owen_king/batchmedia/shipment/:/data -v /mnt/c/Users/owen_king/kitchen/stovetop/clams_whisper_batch_test/mmif/:/mmif -i --rm  ghcr.io/clamsproject/app-distil-whisper-wrapper:v1.2 python cli.py --pretty true --modelSize small --runningTime true --hwFetch true /mmif/cpb-aacip-002d7648e70_CWhsprTest01_0.mmif /mmif/cpb-aacip-002d7648e70_CWhsprTest01_1-distil.mmif

ffmpeg version 5.1.5-0+deb12u1 Copyright (c) 2000-2024 the FFmpeg developers
  built with gcc 12 (Debian 12.2.0-14)
  configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared
  libavutil      57. 28.100 / 57. 28.100
  libavcodec     59. 37.100 / 59. 37.100
  libavformat    59. 27.100 / 59. 27.100
  libavdevice    59.  7.100 / 59.  7.100
  libavfilter     8. 44.100 /  8. 44.100
  libswscale      6.  7.100 /  6.  7.100
  libswresample   4.  7.100 /  4.  7.100
  libpostproc    56.  6.100 / 56.  6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/data/cpb-aacip-002d7648e70.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    creation_time   : 2020-09-04T11:00:08.000000Z
    encoder         : Lavf58.20.100
    comment         : 060a2b340101010501010d4313000000f3a11300316905800800460201300f20
  Duration: 00:33:30.06, start: 0.000000, bitrate: 695 kb/s
  Stream #0:0[0x1](eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 640x360 [SAR 1:1 DAR 16:9], 582 kb/s, 29.97 fps, 29.97 tbr, 11988 tbn (default)
    Metadata:
      creation_time   : 2020-09-04T11:00:08.000000Z
      handler_name    : Apple Video Media Handler
      vendor_id       : [0][0][0][0]
      timecode        : 00:59:08;25
  Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 103 kb/s (default)
    Metadata:
      creation_time   : 2020-09-04T11:00:08.000000Z
      handler_name    : Apple Sound Media Handler
      vendor_id       : [0][0][0][0]
  Stream #0:2[0x3](eng): Data: none (tmcd / 0x64636D74)
    Metadata:
      creation_time   : 2020-09-04T11:00:08.000000Z
      handler_name    : Apple Video Media Handler
      timecode        : 00:59:08;25
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/tmp/tmplhxm1_ju/m1_16kHz.wav':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    ICMT            : 060a2b340101010501010d4313000000f3a11300316905800800460201300f20
    ISFT            : Lavf59.27.100
  Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s (default)
    Metadata:
      creation_time   : 2020-09-04T11:00:08.000000Z
      handler_name    : Apple Sound Media Handler
      vendor_id       : [0][0][0][0]
      encoder         : Lavc59.37.100 pcm_s16le
size=   62814kB time=00:33:30.04 bitrate= 256.0kbits/s speed= 854x
video:0kB audio:62814kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000236%
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.

real    4m35.921s
user    0m0.043s
sys     0m0.028s

Expected behavior

No response

Log output

No response

Screenshots

No response

Additional context

No response

keighrim commented 2 months ago

I can't reproduce this problem, while running the same video asset on my laptop. I tried for-loops with app.py (http) and cli.py (cli) to repeatedly run the whisper three times, and the running time is consistently and significantly shorter that what's reported.

$ for m in *.mmif; do echo $m;  cat $m | jq '. | .documents[0].properties.location + " app-version:" + .views[0].metadata.app + " runtime:" +  .views[0].metadata.appRunningTime'; done
cli.1.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:04:08.126495"
cli.2.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:05:52.176937"
cli.3.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:05:56.050787"
http.1.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:38.485740"
http.2.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:40.994840"
http.3.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:46.344477"

(running time of cli.*.mmif files are slower than http due to other background processors )

mmifs.zip

owencking commented 2 months ago

cpb-aacip-dummy1_WhisperDummyTEST_0.mmif.json This problem continues for me.

Running on an EC2 g5.2xlarge instance, I ran this command:

$ time docker run -v /wgbh/aapb_media/unrestricted/:/data -v /wgbh/aapb_data/batch_testing/mmif/:/mmif -i --rm ghcr.io/clamsproject/app-whisper-wrapper:v10 python cli.py --pretty true --modelSize tiny --modelLang en-US --task transcribe --initialPrompt " " --noSpeechThreshold 0.4 --runningTime true --hwFetch true -- /mmif/cpb-aacip-dummy1_WhisperDummyTEST_0.mmif /mmif/cpb-aacip-dummy1_WhisperDummyTEST_1.mmif

yielding:

real    2m49.315s
user    0m0.015s
sys     0m0.019s

In contrast, I ran this

$ time whisper --model tiny --language en --task transcribe --initial_prompt " " --no_speech_threshold 0.4 --word_timestamps True cpb-aacip-dummy1.mp4

yielding:

real    1m4.921s
user    1m4.637s
sys     0m3.549s

Media file

For both runs, the CLAMS app downloaded the "tiny" library which is only 72MB and dowloads in less than 2s.

The CLAMS app takes about 2.5x the time, compared to the standard CLI version of Whisper.

keighrim commented 1 month ago

Using a much shorter test file,

 $ ffmpeg -i /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3 2>&1 | grep Duration 
  Duration: 00:03:11.62, start: 0.023021, bitrate: 192 kb/s

I can observe a significantly more time consumed when time-stamping words;

$ time whisper --fp16 False --language en --model tiny   /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3  > /dev/null

real    0m16.040s
user    1m50.845s
sys     0m5.132s

$ time whisper --fp16 False --language en --model tiny --word_timestamps True  /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3  > /dev/null

real    2m24.231s
user    30m49.201s
sys     0m21.988s

Then, with the said (long) file,

$ time whisper --fp16 False --language en --model tiny   ~/cpb-aacip-002d7648e70.mp4  > /dev/null

real    2m40.821s
user    20m12.450s
sys     0m41.715s

$ time whisper --fp16 False --language en --model tiny  --word_timestamps True  ~/cpb-aacip-002d7648e70.mp4  > /dev/null

real    17m28.261s
user    232m24.400s
sys     2m48.780s

CLAMS whisper wrapper always uses word_timestamps=True because we believe fine-grained time stamps are very important information for any sort of down stream task, hence this might be the cause this issue.

keighrim commented 1 month ago

A possible "fix" can be to expose a parameter to turn off the word-level time stamps to get transcripts as fast as possible, then post-process subsets of the output MMIFs (based on demand) with a force-aligner app that is sensitive to and capable of processing TimeFrame--Alignment--TextDocument triples (such an app doesn't exist currently).

owencking commented 1 month ago

I'm seeing only a minor difference in runtime that depend on the presence of the --word_timestamps flag. Here are the results running the tiny model with timestamps on vs timestamps off.

69s vs 67s 63s vs 58s 72s vs 67s

See results in this file.

Here are the actual commands I'm using to run these tests:

model=tiny
op="/wgbh/owen_whisper_cli_tests/test_tiny_wts/"
shopt -s nullglob
time for f in *.{mp3,mp4,mov,wav} ; do { time whisper --language en --model $model --no_speech_threshold 0.4  --word_timestamps True -o $op $f ; } 2> "$op"/$f.time ; done
model=tiny
op="/wgbh/owen_whisper_cli_tests/test_tiny_no-wts/"
shopt -s nullglob
time for f in *.{mp3,mp4,mov,wav} ; do { time whisper --language en --model $model --no_speech_threshold 0.4  --word_timestamps False -o $op $f ; } 2> "$op"/$f.time ; done
keighrim commented 1 month ago

On this system

$ fastfetch --logo none
...
OS: Red Hat Enterprise Linux 9.3 x86_64
Host: PowerEdge R820
Kernel: Linux 5.14.0-362.18.1.el9_3.x86_64
Uptime: 229 days(!), 23 hours, 54 mins
Packages: 1059 (rpm)
Shell: bash 5.1.8
CPU: Intel(R) Xeon(R) E5-4620 0 (64) @ 2.60 GHz
GPU: Matrox Electronics Systems Ltd. G200eR2
Memory: 33.20 GiB / 125.29 GiB (26%)
Swap: 43.83 MiB / 32.00 GiB (0%)

$ cat /proc/cpuinfo | grep flag | head -n1
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d

I tried the short time (mentioned above) with 4 different invocations

  1. whipser command without --word_timestamps
  2. whipser command with --word_timestamps
  3. cli.py without docker
  4. app.py with docker

All models were pre-downloaded, and I ran each invocation three times; here's the runtime measured by time command (real values)

cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
    tiny + whisper-noTS took
        67.77033333333333 seconds in average (['1m4.455s', '1m20.657s', '0m58.199s'])
    tiny + whisper-TS took
        366.452 seconds in average (['6m12.131s', '5m59.743s', '6m7.482s'])
    tiny + cli-TS took
        295.10333333333335 seconds in average (['5m12.752s', '4m33.391s', '4m59.167s'])
    tiny + docker-TS took
        414.22166666666664 seconds in average (['6m48.265s', '6m58.198s', '6m56.202s'])
    large + whisper-noTS took
        4932.864333333334 seconds in average (['78m6.036s', '79m8.295s', '89m24.262s'])
    large + whisper-TS took
        14649.093 seconds in average (['209m48.613s', '217m27.607s', '305m11.059s'])
    large + cli-TS took
        636.348 seconds in average (['10m38.453s', '10m32.948s', '10m37.643s'])
    large + docker-TS took
        636.1976666666666 seconds in average (['10m33.827s', '10m41.223s', '10m33.543s'])

I'm running the same loop on other files (one more "short" one and two "long" files) and will report more numbers once it all finished (might take 1-2 days).

keighrim commented 1 month ago

While waiting for the whisper(s) to complete transcription, I ran the same script on a machine with powerful GPUs.

$ fastfetch --logo none
...
OS: Red Hat Enterprise Linux 9.4 x86_64
Host: ESC4000A-E11
Kernel: 5.14.0-427.22.1.el9_4.x86_64
Uptime: 82 days, 18 hours, 52 mins
Packages: 1156 (rpm)
CPU: AMD EPYC 7543 32-Core (64) @ 3.73789 GHz
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: ASPEED Technology, Inc. ASPEED Graphics Family
Memory: 5.52 GiB / 251 GiB (2%)

And here's the numbers (non-GPU numbers in the top half of the following are from the previous experiment and should be identical to those in the above comment).

cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
    tiny + whisper-noTS took
        67.77033333333333 seconds in average (['0m58.199s', '1m4.455s', '1m20.657s'])
    tiny + whisper-TS took
        366.452 seconds in average (['5m59.743s', '6m12.131s', '6m7.482s'])
    tiny + cli-TS took
        295.10333333333335 seconds in average (['5m12.752s', '4m33.391s', '4m59.167s'])
    tiny + docker-TS took
        414.22166666666664 seconds in average (['6m48.265s', '6m56.202s', '6m58.198s'])
    large + whisper-noTS took
        4932.864333333334 seconds in average (['79m8.295s', '78m6.036s', '89m24.262s'])
    large + whisper-TS took
        14649.093 seconds in average (['209m48.613s', '217m27.607s', '305m11.059s'])
    large + cli-TS took
        636.348 seconds in average (['10m32.948s', '10m38.453s', '10m37.643s'])
    large + docker-TS took
        636.1976666666666 seconds in average (['10m33.827s', '10m33.543s', '10m41.223s'])
    tiny + whisper-noTS + GPU took
        11.183666666666667 seconds in average (['0m11.095s', '0m11.027s', '0m11.429s'])
    tiny + whisper-TS + GPU took
        11.886666666666665 seconds in average (['0m11.917s', '0m11.882s', '0m11.861s'])
    tiny + cli-TS + GPU took
        12.519666666666666 seconds in average (['0m12.916s', '0m12.326s', '0m12.317s'])
    tiny + docker-TS + GPU took
        4.543666666666667 seconds in average (['0m4.495s', '0m4.695s', '0m4.441s'])
    large + whisper-noTS + GPU took
        85.21466666666667 seconds in average (['1m26.239s', '1m26.280s', '1m23.125s'])
    large + whisper-TS + GPU took
        84.09533333333333 seconds in average (['1m28.422s', '1m24.722s', '1m19.142s'])
    large + cli-TS + GPU took
        26.631333333333334 seconds in average (['0m26.210s', '0m26.668s', '0m27.016s'])
    large + docker-TS + GPU took
        6.2459999999999996 seconds in average (['0m6.074s', '0m6.105s', '0m6.559s'])

I'm seeing the effect of word-level timestamping is gone with GPU, and for some reason, dockerized HTTP mode of the clams app runs many-fold faster than the vanilla whisper command.

owencking commented 1 month ago

dockerized HTTP mode of the clams app runs many-fold faster than the vanilla whisper command

I noticed that. In fact, that seems to be true on both the GPU system and the less powerful system. It really doesn't make any sense. I think it must be a bug! Maybe some parameter isn't getting passed?

owencking commented 1 month ago

These are really helpful figures Keigh! Awesome. In some ways, they raise more questions than they answer. I'm looking forward to discussing on Monday.

If you have time, could you please try adding cli.py with docker to the test (for both tiny and large)?

I ask because that's the configuration that's most relevant to GBH -- the one we're most likely to use in production. (It's also the configuration for which I've observed slowness.)

keighrim commented 1 month ago

After the discussion yesterday, I ran some additional tests, and found

  1. when running app.py --production (puts the app on gunicorn server, default mode for the containerized app) will spawn a new python3 process for each request, resulting in duplicate models loaded in the memory and eventually causing OOM errors.
  2. there are some additional decode_options that are passed to the transcribe function when running whisper command, but not when running the CLAMS wrapper.

So to address the first problem, I simply ran the containers without the --production flag. To address the second problem, I added some additional default options to the transcribe function in the app.py file, taken from the default arguments of the whisper command. (I think the options that play some significant role are beam_size and best_k being 5, as without setting them via cli arguments, the search size is automatically set to 1) Let's call this new app.py as 11.

Here's more findings from runtime reports under GPU environment after fixing the above issues:

  1. unlike the CPU-only experiments, when running whisper command, adding word-level timestamps tends to take only insignificantly more time
    cpb-aacip-002d7648e70 : duration=00:33:30.06
        large-v3 + whisper-noTS + GPU took
            366.4783333333333 seconds in average (['6m21.878s', '5m49.776s', '6m7.781s'])
        large-v3 + whisper-TS + GPU took
            374.04400000000004 seconds in average (['7m11.541s', '6m10.089s', '5m20.502s'])
    cpb-aacip-28eaf84d060 : duration=00:30:59.02
        large-v3 + whisper-noTS + GPU took
            527.2203333333333 seconds in average (['9m3.110s', '8m47.737s', '8m30.814s'])
        large-v3 + whisper-TS + GPU took
            515.8396666666666 seconds in average (['8m43.458s', '9m5.316s', '7m58.745s'])
    cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
        large-v3 + whisper-noTS + GPU took
            79.96833333333335 seconds in average (['1m19.341s', '1m26.781s', '1m13.783s'])
        large-v3 + whisper-TS + GPU took
            86.51033333333334 seconds in average (['1m32.932s', '1m26.601s', '1m19.998s'])
    cpb-aacip-528-h12v40m45x : duration=00:01:47.76
        large-v3 + whisper-noTS + GPU took
            25.640666666666664 seconds in average (['0m25.420s', '0m25.078s', '0m26.424s'])
        large-v3 + whisper-TS + GPU took
            28.659999999999997 seconds in average (['0m27.770s', '0m29.493s', '0m28.717s'])
  2. setting search size to 5 tends to add more time:
    cpb-aacip-002d7648e70 : duration=00:33:30.06
        large-v3 + clipy-TS + GPU took
            419.63866666666667 seconds in average (['7m44.655s', '6m32.399s', '6m41.862s'])
        large-v3 + clipy-11-TS + GPU took
            434.7486666666667 seconds in average (['7m51.646s', '7m37.290s', '6m15.310s'])
        large-v3 + clipy-docker-TS + GPU took
            400.096 seconds in average (['7m47.500s', '5m3.098s', '7m9.690s'])
        large-v3 + clipy-11-docker-TS + GPU took
            420.49066666666675 seconds in average (['8m16.495s', '6m57.665s', '5m47.312s'])
    cpb-aacip-28eaf84d060 : duration=00:30:59.02
        large-v3 + clipy-TS + GPU took
            359.358 seconds in average (['7m9.318s', '5m9.989s', '5m38.767s'])
        large-v3 + clipy-11-TS + GPU took
            492.19233333333335 seconds in average (['7m22.212s', '8m52.446s', '8m21.919s'])
        large-v3 + clipy-docker-TS + GPU took
            399.6783333333333 seconds in average (['7m12.169s', '6m5.360s', '6m41.506s'])
        large-v3 + clipy-11-docker-TS + GPU took
            578.5519999999999 seconds in average (['9m42.632s', '10m55.992s', '8m17.032s'])
    cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
        large-v3 + clipy-TS + GPU took
            26.94066666666667 seconds in average (['0m26.204s', '0m29.027s', '0m25.591s'])
        large-v3 + clipy-11-TS + GPU took
            81.77533333333334 seconds in average (['1m25.978s', '1m18.532s', '1m20.816s'])
        large-v3 + clipy-docker-TS + GPU took
            30.127666666666666 seconds in average (['0m32.336s', '0m29.940s', '0m28.107s'])
        large-v3 + clipy-11-docker-TS + GPU took
            81.04266666666668 seconds in average (['1m18.848s', '1m24.591s', '1m19.689s'])
    cpb-aacip-528-h12v40m45x : duration=00:01:47.76
        large-v3 + clipy-TS + GPU took
            27.816999999999997 seconds in average (['0m27.728s', '0m28.215s', '0m27.508s'])
        large-v3 + clipy-11-TS + GPU took
            23.140666666666664 seconds in average (['0m23.386s', '0m22.100s', '0m23.936s'])
        large-v3 + clipy-docker-TS + GPU took
            29.872666666666664 seconds in average (['0m28.712s', '0m30.191s', '0m30.715s'])
        large-v3 + clipy-11-docker-TS + GPU took
            26.674666666666667 seconds in average (['0m27.601s', '0m27.098s', '0m25.325s'])
  3. that longer running time with the larger search size is actually closer to the vanilla whisper runs:
    cpb-aacip-002d7648e70 : duration=00:33:30.06
        large-v3 + whisper-TS + GPU took
            374.04400000000004 seconds in average (['7m11.541s', '6m10.089s', '5m20.502s'])
        large-v3 + clipy-docker-TS + GPU took
            400.096 seconds in average (['7m47.500s', '5m3.098s', '7m9.690s'])
        large-v3 + clipy-11-docker-TS + GPU took
            420.49066666666675 seconds in average (['8m16.495s', '6m57.665s', '5m47.312s'])
    cpb-aacip-28eaf84d060 : duration=00:30:59.02
        large-v3 + whisper-TS + GPU took
            515.8396666666666 seconds in average (['8m43.458s', '9m5.316s', '7m58.745s'])
        large-v3 + clipy-docker-TS + GPU took
            399.6783333333333 seconds in average (['7m12.169s', '6m5.360s', '6m41.506s'])
        large-v3 + clipy-11-docker-TS + GPU took
            578.5519999999999 seconds in average (['9m42.632s', '10m55.992s', '8m17.032s'])
    cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
        large-v3 + whisper-TS + GPU took
            86.51033333333334 seconds in average (['1m32.932s', '1m26.601s', '1m19.998s'])
        large-v3 + clipy-docker-TS + GPU took
            30.127666666666666 seconds in average (['0m32.336s', '0m29.940s', '0m28.107s'])
        large-v3 + clipy-11-docker-TS + GPU took
            81.04266666666668 seconds in average (['1m18.848s', '1m24.591s', '1m19.689s'])
    cpb-aacip-528-h12v40m45x : duration=00:01:47.76
        large-v3 + whisper-TS + GPU took
            28.659999999999997 seconds in average (['0m27.770s', '0m29.493s', '0m28.717s'])
        large-v3 + clipy-docker-TS + GPU took
            29.872666666666664 seconds in average (['0m28.712s', '0m30.191s', '0m30.715s'])
        large-v3 + clipy-11-docker-TS + GPU took
            26.674666666666667 seconds in average (['0m27.601s', '0m27.098s', '0m25.325s'])
  4. just by rough eyeballing, transcripts with the larger search size seem to be in better quality:

    $ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-whisper-TS.out | head
    [00:00.000 --> 00:00.200]  you
    [00:30.000 --> 00:33.640]  Thank you.
    [01:09.340 --> 01:10.300]  Thank you.
    [01:32.640 --> 01:34.640]  This week on Warriors and Company.
    [01:35.500 --> 01:52.360]  Is this really our model for the Middle East that we are going to bomb countries continuously, take part in civil wars, sometimes supporting one side, maybe supporting the other, with no means or no real desire or effort to achieve a peace?
    [01:52.600 --> 01:58.200]  As much as President Obama wishes we weren't the world's policemen, perhaps we are.
    [01:58.340 --> 01:59.980]  And there's no escaping that.
    [02:00.000 --> 02:00.560]  That curse.
    [02:01.280 --> 02:02.880]  Funding is provided by
    [02:02.880 --> 02:06.340]  Ann Gumowitz, encouraging the renewal of democracy.
    
    $ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-clipy-TS.out | grep '"text"'  | head
                "text": {
                "text": "you",
                "text": "you",
                "text": "you",
                "text": "you",
                "text": "you",
                "text": "this week on lawyers and company",
                "text": "is this really",
                "text": "our model for the middle east that we are going",
                "text": "to bomb countries continuously",
    
    $ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-clipy-11-TS.out | grep '"text"'  | head
                "text": {
                "text": "you",
                "text": "Thank you.",
                "text": "Thank you.",
                "text": "This week on Warriors and Company.",
                "text": "Is this really our model for the Middle East that we are going to bomb countries continuously, take part in civil wars, sometimes supporting one side, maybe supporting the other, with no means or no real desire or effort to achieve a peace?",
                "text": "As much as President Obama wishes we weren't the world's policemen, perhaps we are.",
                "text": "And there's no escaping that.",
                "text": "That curse.",
                "text": "Funding is provided by",