Open keighrim opened 2 months ago
I can't reproduce this problem, while running the same video asset on my laptop. I tried for-loops with app.py
(http) and cli.py
(cli) to repeatedly run the whisper three times, and the running time is consistently and significantly shorter that what's reported.
$ for m in *.mmif; do echo $m; cat $m | jq '. | .documents[0].properties.location + " app-version:" + .views[0].metadata.app + " runtime:" + .views[0].metadata.appRunningTime'; done
cli.1.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:04:08.126495"
cli.2.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:05:52.176937"
cli.3.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:05:56.050787"
http.1.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:38.485740"
http.2.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:40.994840"
http.3.mmif
"file:///home/krim/cpb-aacip-002d7648e70.mp4 app-version:http://apps.clams.ai/whisper-wrapper/v10 runtime:0:01:46.344477"
(running time of cli.*.mmif
files are slower than http due to other background processors )
cpb-aacip-dummy1_WhisperDummyTEST_0.mmif.json This problem continues for me.
Running on an EC2 g5.2xlarge instance, I ran this command:
$ time docker run -v /wgbh/aapb_media/unrestricted/:/data -v /wgbh/aapb_data/batch_testing/mmif/:/mmif -i --rm ghcr.io/clamsproject/app-whisper-wrapper:v10 python cli.py --pretty true --modelSize tiny --modelLang en-US --task transcribe --initialPrompt " " --noSpeechThreshold 0.4 --runningTime true --hwFetch true -- /mmif/cpb-aacip-dummy1_WhisperDummyTEST_0.mmif /mmif/cpb-aacip-dummy1_WhisperDummyTEST_1.mmif
yielding:
real 2m49.315s
user 0m0.015s
sys 0m0.019s
In contrast, I ran this
$ time whisper --model tiny --language en --task transcribe --initial_prompt " " --no_speech_threshold 0.4 --word_timestamps True cpb-aacip-dummy1.mp4
yielding:
real 1m4.921s
user 1m4.637s
sys 0m3.549s
For both runs, the CLAMS app downloaded the "tiny" library which is only 72MB and dowloads in less than 2s.
The CLAMS app takes about 2.5x the time, compared to the standard CLI version of Whisper.
Using a much shorter test file,
$ ffmpeg -i /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3 2>&1 | grep Duration
Duration: 00:03:11.62, start: 0.023021, bitrate: 192 kb/s
I can observe a significantly more time consumed when time-stamping words;
$ time whisper --fp16 False --language en --model tiny /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3 > /dev/null
real 0m16.040s
user 1m50.845s
sys 0m5.132s
$ time whisper --fp16 False --language en --model tiny --word_timestamps True /home/krim/working-data/aes-data/wgbh-samples2-less/cpb-aacip-528-0g3gx45t80.mp3 > /dev/null
real 2m24.231s
user 30m49.201s
sys 0m21.988s
Then, with the said (long) file,
$ time whisper --fp16 False --language en --model tiny ~/cpb-aacip-002d7648e70.mp4 > /dev/null
real 2m40.821s
user 20m12.450s
sys 0m41.715s
$ time whisper --fp16 False --language en --model tiny --word_timestamps True ~/cpb-aacip-002d7648e70.mp4 > /dev/null
real 17m28.261s
user 232m24.400s
sys 2m48.780s
CLAMS whisper wrapper always uses word_timestamps=True
because we believe fine-grained time stamps are very important information for any sort of down stream task, hence this might be the cause this issue.
A possible "fix" can be to expose a parameter to turn off the word-level time stamps to get transcripts as fast as possible, then post-process subsets of the output MMIFs (based on demand) with a force-aligner app that is sensitive to and capable of processing TimeFrame
--Alignment
--TextDocument
triples (such an app doesn't exist currently).
I'm seeing only a minor difference in runtime that depend on the presence of the --word_timestamps
flag. Here are the results running the tiny model with timestamps on vs timestamps off.
69s vs 67s 63s vs 58s 72s vs 67s
See results in this file.
Here are the actual commands I'm using to run these tests:
model=tiny
op="/wgbh/owen_whisper_cli_tests/test_tiny_wts/"
shopt -s nullglob
time for f in *.{mp3,mp4,mov,wav} ; do { time whisper --language en --model $model --no_speech_threshold 0.4 --word_timestamps True -o $op $f ; } 2> "$op"/$f.time ; done
model=tiny
op="/wgbh/owen_whisper_cli_tests/test_tiny_no-wts/"
shopt -s nullglob
time for f in *.{mp3,mp4,mov,wav} ; do { time whisper --language en --model $model --no_speech_threshold 0.4 --word_timestamps False -o $op $f ; } 2> "$op"/$f.time ; done
On this system
$ fastfetch --logo none
...
OS: Red Hat Enterprise Linux 9.3 x86_64
Host: PowerEdge R820
Kernel: Linux 5.14.0-362.18.1.el9_3.x86_64
Uptime: 229 days(!), 23 hours, 54 mins
Packages: 1059 (rpm)
Shell: bash 5.1.8
CPU: Intel(R) Xeon(R) E5-4620 0 (64) @ 2.60 GHz
GPU: Matrox Electronics Systems Ltd. G200eR2
Memory: 33.20 GiB / 125.29 GiB (26%)
Swap: 43.83 MiB / 32.00 GiB (0%)
$ cat /proc/cpuinfo | grep flag | head -n1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d
I tried the short time (mentioned above) with 4 different invocations
whipser
command without --word_timestamps
whipser
command with --word_timestamps
cli.py
without docker app.py
with docker All models were pre-downloaded, and I ran each invocation three times; here's the runtime measured by time
command (real values)
cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
tiny + whisper-noTS took
67.77033333333333 seconds in average (['1m4.455s', '1m20.657s', '0m58.199s'])
tiny + whisper-TS took
366.452 seconds in average (['6m12.131s', '5m59.743s', '6m7.482s'])
tiny + cli-TS took
295.10333333333335 seconds in average (['5m12.752s', '4m33.391s', '4m59.167s'])
tiny + docker-TS took
414.22166666666664 seconds in average (['6m48.265s', '6m58.198s', '6m56.202s'])
large + whisper-noTS took
4932.864333333334 seconds in average (['78m6.036s', '79m8.295s', '89m24.262s'])
large + whisper-TS took
14649.093 seconds in average (['209m48.613s', '217m27.607s', '305m11.059s'])
large + cli-TS took
636.348 seconds in average (['10m38.453s', '10m32.948s', '10m37.643s'])
large + docker-TS took
636.1976666666666 seconds in average (['10m33.827s', '10m41.223s', '10m33.543s'])
I'm running the same loop on other files (one more "short" one and two "long" files) and will report more numbers once it all finished (might take 1-2 days).
While waiting for the whisper(s) to complete transcription, I ran the same script on a machine with powerful GPUs.
$ fastfetch --logo none
...
OS: Red Hat Enterprise Linux 9.4 x86_64
Host: ESC4000A-E11
Kernel: 5.14.0-427.22.1.el9_4.x86_64
Uptime: 82 days, 18 hours, 52 mins
Packages: 1156 (rpm)
CPU: AMD EPYC 7543 32-Core (64) @ 3.73789 GHz
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: ASPEED Technology, Inc. ASPEED Graphics Family
Memory: 5.52 GiB / 251 GiB (2%)
And here's the numbers (non-GPU numbers in the top half of the following are from the previous experiment and should be identical to those in the above comment).
cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
tiny + whisper-noTS took
67.77033333333333 seconds in average (['0m58.199s', '1m4.455s', '1m20.657s'])
tiny + whisper-TS took
366.452 seconds in average (['5m59.743s', '6m12.131s', '6m7.482s'])
tiny + cli-TS took
295.10333333333335 seconds in average (['5m12.752s', '4m33.391s', '4m59.167s'])
tiny + docker-TS took
414.22166666666664 seconds in average (['6m48.265s', '6m56.202s', '6m58.198s'])
large + whisper-noTS took
4932.864333333334 seconds in average (['79m8.295s', '78m6.036s', '89m24.262s'])
large + whisper-TS took
14649.093 seconds in average (['209m48.613s', '217m27.607s', '305m11.059s'])
large + cli-TS took
636.348 seconds in average (['10m32.948s', '10m38.453s', '10m37.643s'])
large + docker-TS took
636.1976666666666 seconds in average (['10m33.827s', '10m33.543s', '10m41.223s'])
tiny + whisper-noTS + GPU took
11.183666666666667 seconds in average (['0m11.095s', '0m11.027s', '0m11.429s'])
tiny + whisper-TS + GPU took
11.886666666666665 seconds in average (['0m11.917s', '0m11.882s', '0m11.861s'])
tiny + cli-TS + GPU took
12.519666666666666 seconds in average (['0m12.916s', '0m12.326s', '0m12.317s'])
tiny + docker-TS + GPU took
4.543666666666667 seconds in average (['0m4.495s', '0m4.695s', '0m4.441s'])
large + whisper-noTS + GPU took
85.21466666666667 seconds in average (['1m26.239s', '1m26.280s', '1m23.125s'])
large + whisper-TS + GPU took
84.09533333333333 seconds in average (['1m28.422s', '1m24.722s', '1m19.142s'])
large + cli-TS + GPU took
26.631333333333334 seconds in average (['0m26.210s', '0m26.668s', '0m27.016s'])
large + docker-TS + GPU took
6.2459999999999996 seconds in average (['0m6.074s', '0m6.105s', '0m6.559s'])
I'm seeing the effect of word-level timestamping is gone with GPU, and for some reason, dockerized HTTP mode of the clams app runs many-fold faster than the vanilla whisper
command.
dockerized HTTP mode of the clams app runs many-fold faster than the vanilla whisper command
I noticed that. In fact, that seems to be true on both the GPU system and the less powerful system. It really doesn't make any sense. I think it must be a bug! Maybe some parameter isn't getting passed?
These are really helpful figures Keigh! Awesome. In some ways, they raise more questions than they answer. I'm looking forward to discussing on Monday.
If you have time, could you please try adding cli.py
with docker to the test (for both tiny and large)?
I ask because that's the configuration that's most relevant to GBH -- the one we're most likely to use in production. (It's also the configuration for which I've observed slowness.)
After the discussion yesterday, I ran some additional tests, and found
app.py --production
(puts the app on gunicorn server, default mode for the containerized app) will spawn a new python3
process for each request, resulting in duplicate models loaded in the memory and eventually causing OOM errors. decode_options
that are passed to the transcribe
function when running whisper
command, but not when running the CLAMS wrapper. So to address the first problem, I simply ran the containers without the --production
flag. To address the second problem, I added some additional default options to the transcribe
function in the app.py
file, taken from the default arguments of the whisper
command. (I think the options that play some significant role are beam_size and best_k being 5, as without setting them via cli arguments, the search size is automatically set to 1) Let's call this new app.py as 11
.
Here's more findings from runtime reports under GPU environment after fixing the above issues:
whisper
command, adding word-level timestamps tends to take only insignificantly more time
cpb-aacip-002d7648e70 : duration=00:33:30.06
large-v3 + whisper-noTS + GPU took
366.4783333333333 seconds in average (['6m21.878s', '5m49.776s', '6m7.781s'])
large-v3 + whisper-TS + GPU took
374.04400000000004 seconds in average (['7m11.541s', '6m10.089s', '5m20.502s'])
cpb-aacip-28eaf84d060 : duration=00:30:59.02
large-v3 + whisper-noTS + GPU took
527.2203333333333 seconds in average (['9m3.110s', '8m47.737s', '8m30.814s'])
large-v3 + whisper-TS + GPU took
515.8396666666666 seconds in average (['8m43.458s', '9m5.316s', '7m58.745s'])
cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
large-v3 + whisper-noTS + GPU took
79.96833333333335 seconds in average (['1m19.341s', '1m26.781s', '1m13.783s'])
large-v3 + whisper-TS + GPU took
86.51033333333334 seconds in average (['1m32.932s', '1m26.601s', '1m19.998s'])
cpb-aacip-528-h12v40m45x : duration=00:01:47.76
large-v3 + whisper-noTS + GPU took
25.640666666666664 seconds in average (['0m25.420s', '0m25.078s', '0m26.424s'])
large-v3 + whisper-TS + GPU took
28.659999999999997 seconds in average (['0m27.770s', '0m29.493s', '0m28.717s'])
cpb-aacip-002d7648e70 : duration=00:33:30.06
large-v3 + clipy-TS + GPU took
419.63866666666667 seconds in average (['7m44.655s', '6m32.399s', '6m41.862s'])
large-v3 + clipy-11-TS + GPU took
434.7486666666667 seconds in average (['7m51.646s', '7m37.290s', '6m15.310s'])
large-v3 + clipy-docker-TS + GPU took
400.096 seconds in average (['7m47.500s', '5m3.098s', '7m9.690s'])
large-v3 + clipy-11-docker-TS + GPU took
420.49066666666675 seconds in average (['8m16.495s', '6m57.665s', '5m47.312s'])
cpb-aacip-28eaf84d060 : duration=00:30:59.02
large-v3 + clipy-TS + GPU took
359.358 seconds in average (['7m9.318s', '5m9.989s', '5m38.767s'])
large-v3 + clipy-11-TS + GPU took
492.19233333333335 seconds in average (['7m22.212s', '8m52.446s', '8m21.919s'])
large-v3 + clipy-docker-TS + GPU took
399.6783333333333 seconds in average (['7m12.169s', '6m5.360s', '6m41.506s'])
large-v3 + clipy-11-docker-TS + GPU took
578.5519999999999 seconds in average (['9m42.632s', '10m55.992s', '8m17.032s'])
cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
large-v3 + clipy-TS + GPU took
26.94066666666667 seconds in average (['0m26.204s', '0m29.027s', '0m25.591s'])
large-v3 + clipy-11-TS + GPU took
81.77533333333334 seconds in average (['1m25.978s', '1m18.532s', '1m20.816s'])
large-v3 + clipy-docker-TS + GPU took
30.127666666666666 seconds in average (['0m32.336s', '0m29.940s', '0m28.107s'])
large-v3 + clipy-11-docker-TS + GPU took
81.04266666666668 seconds in average (['1m18.848s', '1m24.591s', '1m19.689s'])
cpb-aacip-528-h12v40m45x : duration=00:01:47.76
large-v3 + clipy-TS + GPU took
27.816999999999997 seconds in average (['0m27.728s', '0m28.215s', '0m27.508s'])
large-v3 + clipy-11-TS + GPU took
23.140666666666664 seconds in average (['0m23.386s', '0m22.100s', '0m23.936s'])
large-v3 + clipy-docker-TS + GPU took
29.872666666666664 seconds in average (['0m28.712s', '0m30.191s', '0m30.715s'])
large-v3 + clipy-11-docker-TS + GPU took
26.674666666666667 seconds in average (['0m27.601s', '0m27.098s', '0m25.325s'])
whisper
runs:
cpb-aacip-002d7648e70 : duration=00:33:30.06
large-v3 + whisper-TS + GPU took
374.04400000000004 seconds in average (['7m11.541s', '6m10.089s', '5m20.502s'])
large-v3 + clipy-docker-TS + GPU took
400.096 seconds in average (['7m47.500s', '5m3.098s', '7m9.690s'])
large-v3 + clipy-11-docker-TS + GPU took
420.49066666666675 seconds in average (['8m16.495s', '6m57.665s', '5m47.312s'])
cpb-aacip-28eaf84d060 : duration=00:30:59.02
large-v3 + whisper-TS + GPU took
515.8396666666666 seconds in average (['8m43.458s', '9m5.316s', '7m58.745s'])
large-v3 + clipy-docker-TS + GPU took
399.6783333333333 seconds in average (['7m12.169s', '6m5.360s', '6m41.506s'])
large-v3 + clipy-11-docker-TS + GPU took
578.5519999999999 seconds in average (['9m42.632s', '10m55.992s', '8m17.032s'])
cpb-aacip-528-0g3gx45t80 : duration=00:03:11.62
large-v3 + whisper-TS + GPU took
86.51033333333334 seconds in average (['1m32.932s', '1m26.601s', '1m19.998s'])
large-v3 + clipy-docker-TS + GPU took
30.127666666666666 seconds in average (['0m32.336s', '0m29.940s', '0m28.107s'])
large-v3 + clipy-11-docker-TS + GPU took
81.04266666666668 seconds in average (['1m18.848s', '1m24.591s', '1m19.689s'])
cpb-aacip-528-h12v40m45x : duration=00:01:47.76
large-v3 + whisper-TS + GPU took
28.659999999999997 seconds in average (['0m27.770s', '0m29.493s', '0m28.717s'])
large-v3 + clipy-docker-TS + GPU took
29.872666666666664 seconds in average (['0m28.712s', '0m30.191s', '0m30.715s'])
large-v3 + clipy-11-docker-TS + GPU took
26.674666666666667 seconds in average (['0m27.601s', '0m27.098s', '0m25.325s'])
just by rough eyeballing, transcripts with the larger search size seem to be in better quality:
$ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-whisper-TS.out | head
[00:00.000 --> 00:00.200] you
[00:30.000 --> 00:33.640] Thank you.
[01:09.340 --> 01:10.300] Thank you.
[01:32.640 --> 01:34.640] This week on Warriors and Company.
[01:35.500 --> 01:52.360] Is this really our model for the Middle East that we are going to bomb countries continuously, take part in civil wars, sometimes supporting one side, maybe supporting the other, with no means or no real desire or effort to achieve a peace?
[01:52.600 --> 01:58.200] As much as President Obama wishes we weren't the world's policemen, perhaps we are.
[01:58.340 --> 01:59.980] And there's no escaping that.
[02:00.000 --> 02:00.560] That curse.
[02:01.280 --> 02:02.880] Funding is provided by
[02:02.880 --> 02:06.340] Ann Gumowitz, encouraging the renewal of democracy.
$ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-clipy-TS.out | grep '"text"' | head
"text": {
"text": "you",
"text": "you",
"text": "you",
"text": "you",
"text": "you",
"text": "this week on lawyers and company",
"text": "is this really",
"text": "our model for the middle east that we are going",
"text": "to bomb countries continuously",
$ cat speedtest-outputs/cpb-aacip-002d7648e70.mp4-attempt1-large-v3-gpu-clipy-11-TS.out | grep '"text"' | head
"text": {
"text": "you",
"text": "Thank you.",
"text": "Thank you.",
"text": "This week on Warriors and Company.",
"text": "Is this really our model for the Middle East that we are going to bomb countries continuously, take part in civil wars, sometimes supporting one side, maybe supporting the other, with no means or no real desire or effort to achieve a peace?",
"text": "As much as President Obama wishes we weren't the world's policemen, perhaps we are.",
"text": "And there's no escaping that.",
"text": "That curse.",
"text": "Funding is provided by",
Bug Description
Originally reported by @owencking via slack
Reproduction steps
Expected behavior
No response
Log output
No response
Screenshots
No response
Additional context
No response