audio.whisper

This repository contains an R package which is an Rcpp wrapper around the whisper.cpp C++ library.

The package allows to transcribe audio files using the "Whisper" Automatic Speech Recognition model
The package is based on an inference engine written in C++11, no external software is needed, so that you can directly install and use it from R

Available models

Model	Language	Size	RAM needed	Comment
`tiny` & `tiny.en`	Multilingual & English only	75 MB	390 MB	audio.whisper >=0.3 & 0.2.2
`base` & `base.en`	Multilingual & English only	142 MB	500 MB	audio.whisper >=0.3 & 0.2.2
`small` & `small.en`	Multilingual & English only	466 MB	1.0 GB	audio.whisper >=0.3 & 0.2.2
`medium` & `medium.en`	Multilingual & English only	1.5 GB	2.6 GB	audio.whisper >=0.3 & 0.2.2
`large-v1`	Multilingual	2.9 GB	4.7 GB	audio.whisper >=0.3 & 0.2.2
`large-v2`	Multilingual	2.9 GB	4.7 GB	audio.whisper >=0.3
`large-v3`	Multilingual	2.9 GB	4.7 GB	audio.whisper >=0.3

Installation

For the stable version of this package:

remotes::install_github("bnosac/audio.whisper", ref = "0.3.3") (uses whisper.cpp version 1.5.4)
remotes::install_github("bnosac/audio.whisper", ref = "0.2.2") (uses whisper.cpp version 1.2.1)

Look to the documentation of the functions: help(package = "audio.whisper")

For the development version of this package: remotes::install_github("bnosac/audio.whisper")
Once you gain familiarity with the flow, you can gain faster transcription speeds by reading this section.

Example

Load the model either by providing the full path to the model or specify the shorthand which will download the model

see the help of whisper_download_model for a list of available models and to download a model
you can always download the model manually at https://huggingface.co/ggerganov/whisper.cpp

library(audio.whisper)
model <- whisper("tiny")
model <- whisper("base")
model <- whisper("small")
model <- whisper("medium")
model <- whisper("large-v1")
model <- whisper("large-v2")
model <- whisper("large-v3")
path  <- system.file(package = "audio.whisper", "repo", "ggml-tiny.en-q5_1.bin")
model <- whisper(path)

If you have a GPU (e.g. Mac with Metal or Linux with CUDA and installed audio.whisper as indicated below), you can use it by specifying: model <- whisper("medium", use_gpu = TRUE), otherwise you will use your CPU.

Transcribe a .wav audio file

using predict(model, "path/to/audio/file.wav") and
provide a language which the audio file is in (e.g. en, nl, fr, de, es, zh, ru, jp)
the result contains the segments and the tokens

audio <- system.file(package = "audio.whisper", "samples", "jfk.wav")
trans <- predict(model, newdata = audio, language = "en", n_threads = 2)

trans
$n_segments
[1] 1

$data
 segment         from           to                                                                                                       text
       1 00:00:00.000 00:00:11.000  And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

$tokens
 segment      token token_prob
       1        And  0.7476438
       1         so  0.9042299
       1         my  0.6872202
       1     fellow  0.9984470
       1  Americans  0.9589157
       1        ask  0.2573057
       1        not  0.7678108
       1       what  0.6542882
       1       your  0.9386917
       1    country  0.9854987
       1        can  0.9813995
       1         do  0.9937403
       1        for  0.9791515
       1        you  0.9925495
       1        ask  0.3058807
       1       what  0.8303462
       1        you  0.9735528
       1        can  0.9711444
       1         do  0.9616748
       1        for  0.9778513
       1       your  0.9604713
       1    country  0.9923630
       1          .  0.4983074

Format of the audio

Note about that the audio file needs to be a 16000Hz 16-bit .wav file.

you can use R package av which provides bindings to ffmpeg to convert to that format
or alternatively, use ffmpeg as follows: ffmpeg -i input.wmv -ar 16000 -ac 1 -c:a pcm_s16le output.wav

library(av)
download.file(url = "https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3", 
              destfile = "rant1.mp3", mode = "wb")
av_audio_convert("rant1.mp3", output = "output.wav", format = "wav", sample_rate = 16000)

Transcription

```{r} trans <- predict(model, newdata = "output.wav", language = "en", duration = 30 * 1000, offset = 7 * 1000, token_timestamps = TRUE) trans $n_segments [1] 11 $data segment from to text 1 00:00:07.000 00:00:09.000 Look at the picture. 2 00:00:09.000 00:00:11.000 See the skull. 3 00:00:11.000 00:00:13.000 The part of bone removed. 4 00:00:13.000 00:00:16.000 The master race Frankenstein radio controls. 5 00:00:16.000 00:00:18.000 The brain thoughts broadcasting radio. 6 00:00:18.000 00:00:21.000 The eyesight television. The Frankenstein earphone radio. 7 00:00:21.000 00:00:25.000 The threshold brain wash radio. The latest new skull reforming. 8 00:00:25.000 00:00:28.000 To contain all Frankenstein controls. 9 00:00:28.000 00:00:31.000 Even in thin skulls of white pedigree males. 10 00:00:31.000 00:00:34.000 Visible Frankenstein controls. 11 00:00:34.000 00:00:37.000 The synthetic nerve radio, directional and an alloop. $tokens segment token token_prob token_from token_to 1 Look 0.4281234 00:00:07.290 00:00:07.420 1 at 0.9485379 00:00:07.420 00:00:07.620 1 the 0.9758387 00:00:07.620 00:00:07.940 1 picture 0.9734664 00:00:08.150 00:00:08.580 1 . 0.9688568 00:00:08.680 00:00:08.910 2 See 0.9847929 00:00:09.000 00:00:09.420 2 the 0.7588121 00:00:09.420 00:00:09.840 2 skull 0.9989663 00:00:09.840 00:00:10.310 2 . 0.9548351 00:00:10.550 00:00:11.000 3 The 0.9914295 00:00:11.000 00:00:11.170 3 part 0.9789217 00:00:11.560 00:00:11.600 3 of 0.9958754 00:00:11.600 00:00:11.770 3 bone 0.9759618 00:00:11.770 00:00:12.030 3 removed 0.9956936 00:00:12.190 00:00:12.710 3 . 0.9965582 00:00:12.710 00:00:12.940 4 The 0.9923794 00:00:13.000 00:00:13.210 4 master 0.9875370 00:00:13.350 00:00:13.640 4 race 0.9803119 00:00:13.640 00:00:13.930 4 Franken 0.9982004 00:00:13.930 00:00:14.440 4 stein 0.9998384 00:00:14.440 00:00:14.800 4 radio 0.9780943 00:00:14.800 00:00:15.160 4 controls 0.9893969 00:00:15.160 00:00:15.700 4 . 0.9796444 00:00:15.750 00:00:16.000 5 The 0.9870584 00:00:16.000 00:00:16.140 5 brain 0.9964160 00:00:16.330 00:00:16.430 5 thoughts 0.9657190 00:00:16.490 00:00:16.870 5 broadcasting 0.9860524 00:00:16.870 00:00:17.530 5 radio 0.9439469 00:00:17.530 00:00:17.800 5 . 0.9973570 00:00:17.800 00:00:17.960 6 The 0.9774312 00:00:18.000 00:00:18.210 6 eyesight 0.9293824 00:00:18.250 00:00:18.910 6 television 0.9896797 00:00:18.910 00:00:19.690 6 . 0.9961249 00:00:19.810 00:00:20.000 6 The 0.5245560 00:00:20.000 00:00:20.090 6 Franken 0.9829712 00:00:20.090 00:00:20.300 6 stein 0.9999006 00:00:20.320 00:00:20.470 6 ear 0.9958365 00:00:20.470 00:00:20.560 6 phone 0.9876402 00:00:20.560 00:00:20.720 6 radio 0.9854031 00:00:20.720 00:00:20.860 6 . 0.9930948 00:00:20.950 00:00:21.000 7 The 0.9887797 00:00:21.000 00:00:21.200 7 threshold 0.9979410 00:00:21.200 00:00:21.750 7 brain 0.9938735 00:00:21.880 00:00:22.160 7 wash 0.9781434 00:00:22.160 00:00:22.430 7 radio 0.9931799 00:00:22.430 00:00:22.770 7 . 0.9941305 00:00:22.770 00:00:23.000 7 The 0.5658014 00:00:23.000 00:00:23.230 7 latest 0.9985833 00:00:23.230 00:00:23.690 7 new 0.9956740 00:00:23.690 00:00:23.920 7 skull 0.9990881 00:00:23.920 00:00:24.300 7 reform 0.9664753 00:00:24.300 00:00:24.760 7 ing 0.9966548 00:00:24.760 00:00:24.870 7 . 0.9644036 00:00:25.000 00:00:25.000 8 To 0.9600158 00:00:25.010 00:00:25.170 8 contain 0.9938834 00:00:25.170 00:00:25.770 8 all 0.9625537 00:00:25.770 00:00:26.020 8 Franken 0.9710320 00:00:26.020 00:00:26.620 8 stein 0.9998924 00:00:26.620 00:00:27.040 8 controls 0.9955972 00:00:27.040 00:00:27.720 8 . 0.9759502 00:00:27.720 00:00:28.000 9 Even 0.9824280 00:00:28.000 00:00:28.300 9 in 0.9928908 00:00:28.300 00:00:28.450 9 thin 0.9970337 00:00:28.450 00:00:28.750 9 skull 0.9954430 00:00:28.750 00:00:29.120 9 s 0.9987136 00:00:29.120 00:00:29.180 9 of 0.9772032 00:00:29.280 00:00:29.350 9 white 0.9897125 00:00:29.350 00:00:29.720 9 ped 0.9980962 00:00:29.810 00:00:29.960 9 ig 0.9971448 00:00:29.960 00:00:30.100 9 ree 0.9996273 00:00:30.100 00:00:30.320 9 males 0.9934869 00:00:30.390 00:00:30.700 9 . 0.9789821 00:00:30.780 00:00:30.990 10 Vis 0.8950536 00:00:31.050 00:00:31.250 10 ible 0.9988410 00:00:31.290 00:00:31.690 10 Franken 0.9976653 00:00:31.690 00:00:32.360 10 stein 0.9999056 00:00:32.430 00:00:32.880 10 controls 0.9977503 00:00:32.880 00:00:33.670 10 . 0.9917345 00:00:33.680 00:00:34.000 11 The 0.9685771 00:00:34.000 00:00:34.180 11 synthetic 0.9910653 00:00:34.180 00:00:34.730 11 nerve 0.9979016 00:00:34.730 00:00:35.030 11 radio 0.9594643 00:00:35.030 00:00:35.330 11 , 0.8811045 00:00:35.330 00:00:35.450 11 directional 0.9930993 00:00:35.450 00:00:36.120 11 and 0.8905478 00:00:36.120 00:00:36.300 11 an 0.9520693 00:00:36.300 00:00:36.420 11 all 0.7639735 00:00:36.420 00:00:36.600 11 oop 0.9988559 00:00:36.600 00:00:36.730 11 . 0.9924630 00:00:36.830 00:00:37.000 ```

Notes on silences

If you want remove silences from your audio files. You could use R packages

Speed of transcribing

The tensor operations contained in ggml.h / ggml.c are highly optimised depending on the hardware of your CPU

It has AVX intrinsics support for x86 architectures, VSX intrinsics support for POWER architectures, Mixed F16 / F32 precision, for Apple silicon allows optimisation via Arm Neon, the Accelerate framework and Metal and provides GPU support for NVIDIA
In order to gain from these massive transcription speedups, you need to set the correct compilation flags when you install the R package, otherwise transcription speed will be suboptimal (a 5-minute audio fragment can either be transcribed in 40 minutes or 10 seconds depending on your hardware).
Normally using the installation as described above, some of these compilation flags are detected and you'll see these printed when doing the installation
It is however advised to set these compilation C flags yourself as follows right before you install the package such that /src/Makevars knows you want these optimisations for sure. This can be done by defining the environment variables WHISPER_CFLAGS, WHISPER_CPPFLAGS, WHISPER_LIBS as follows.

Sys.setenv(WHISPER_CFLAGS = "-mavx -mavx2 -mfma -mf16c")
remotes::install_github("bnosac/audio.whisper", ref = "0.3.3", force = TRUE)
Sys.unsetenv("WHISPER_CFLAGS")

To find out which hardware acceleration options your hardware supports, you can go to https://github.com/bnosac/audio.whisper/issues/26 and look for the CFLAGS (and optionally CXXFLAGS and LDFLAGS) settings which make sense on your hardware

Common settings to set for WHISPER_CFLAGS are -mavx -mavx2 -mfma -mf16c and extra possible flags -msse3 and mssse3
- E.g. on my local Windows Intel machine I could set -mavx -mavx2 -mfma -mf16c
- For Mac users you can speed up transcriptions by setting before installation of audio.whisper
  - Sys.setenv(WHISPER_ACCELERATE = "1") if your computer has the Accelerate framework
  - Sys.setenv(WHISPER_METAL = "1") if your computer has a GPU based on Metal
- For Linux users which have a NVIDIA GPU, processing can be offloaded to the GPU to a large extent through cuBLAS. For this speedup, install the R package with following settings
  - Sys.setenv(WHISPER_CUBLAS = "1")
  - make sure nvcc is in the PATH (e.g. export PATH=/usr/local/cuda-12.3/bin${PATH:+:${PATH}}) and set the path to CUDA if it is not at /usr/local/cuda as in Sys.setenv(CUDA_PATH = "/usr/local/cuda-12.3")
- On my older local Ubuntu machine there were no optimisation possibilities. Your mileage may vary.
- If you have OpenBLAS installed, you can considerably speed up transcription by installing the R package with Sys.setenv(WHISPER_OPENBLAS = "1")
If you need extra settings in PKG_CPPFLAGS (CXXFLAGS), you can e.g. use Sys.setenv(WHISPER_CPPFLAGS = "-mcpu=native") before installing the package
If you need extra settings in PKG_LIBS, you can e.g. use Sys.setenv(WHISPER_LIBS = "-framework Accelerate") before installing the package
If you need custom settings, you can update PKG_CFLAGS / PKG_CPPFLAGS / PKG_LIBS in /src/Makevars directly.

Note that if your hardware does not support these compilation flags, you'll get a crash when transcribing audio.

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

bnosac / audio.whisper

readme