Whisper API

A simple API to access whisper for speech to text transcription.

It simplifies offloading the heavy lifting of using Whisper to a central GPU server, which can be accessed by multiple people.

Features

Transcribes audio files to text using OpenAI Whisper

Simple frontend

Includes a simple static frontend to transcribe audio files (/)
Includes a interactive API documentation using the Swagger UI (/docs)
Implements a task queue to handle multiple requests (first in, first out)

Resource friendly GPU support

Uses GPU acceleration if available
Supports loading the model into VRAM on startup OR on first request
Supports unloading the model after a certain time of inactivity

Privacy focussed

Stateless: to prioritize data privacy, the API only stores data in RAM. Audio files are stored using tempfile and are deleted after processing.
Logs don't contain any transcribed text and transcription ids are obfuscated
Results are deleted from RAM after a given time

Setup recommendations

This service performs the best, when it is run on a server with a GPU. For using the high-quality models, I recommend using a GPU with at least 12GB of VRAM. The RTX 3060 12GB is most likely the cheapest option for this task.

This service is optimized for a multi user environment. I will discuss 2 setups:

Personal setup

When you are the only user of this service, you can run it on your local network. This way you can access the service from any device in your network. Use a VPN to access the service from outside your network.

SMB & research setup

When hosting this service in a more professional environment, we should consider the following:

should the service be accessible from outside the network?
who should be able to access the service?

If only users on your local network should be able to access the service and everyone in your network should be able to access it, you can run the service on a server in your network without any further configuration.

If you need to implement access control, I suggest the following:

use a reverse proxy to terminate SSL
use oauth2 to only allow users which belong to a certain group to access the service

My setup uses the following software:

NGINX as a reverse proxy
Keycloak as an identity provider
oauth2_proxy to handle oauth2 authentication and session tokens

In case you have some questions about the setup or software, feel free to reach out!

How to deploy

Linux - docker

Pre-requisites:

have Docker installed
install NVIDIA CUDA (if you want to use GPU acceleration)
install NVidia Container Toolkit (if you want to use GPU acceleration)

Create the following compose.yaml file:

services:
  whisperAPI:
    image: ghcr.io/mayniklas/whisper_api:latest
    ports:
      - "3001:3001"
    environment:
      - PORT=3001
      - LOAD_MODEL_ON_STARTUP=1
      # - UNLOAD_MODEL_AFTER_S=300
      # - DEVELOP_MODE=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

When nop using GPU acceleration, remove the deploy section from the compose.yaml file.

Run the following commands:

docker compose up -d

You can also use docker directly:

docker run -d -p 3001:3001 --gpus all ghcr.io/mayniklas/whisper_api:latest

Linux

Pre-requisites:

install NVIDIA CUDA
install ffmpeg (e.g. sudo apt install ffmpeg)

Since project is a well packaged python project, you don't have to worry about any project specific installation steps.

Create a virtual environment
Install this project in the virtual environment
Create a systemd service that runs the server

NixOS

Since I'm personally using NixOS, I created a module that is available through this flake.nix.

Add the following input to your flake.nix:

{
  inputs = {
    whisper_api.url = "github:MayNiklas/whisper_api";
  };
}

Import the module in your configuration.nix and use it:

{ pkgs, config, lib, whisper_api, ... }: {

  imports = [ whisper_api.nixosModules.whisper_api ];

  services.whisper_api = {
    enable = true;
    withCUDA = true;
    loadModelOnStartup = true;
    # unloadModelAfterSeconds = 300;
    listen = "0.0.0.0";
    openFirewall = true;
    environment = { };
  };

}

Development

Linux

Pre-requisites:

install NVIDIA CUDA
install ffmpeg (e.g. sudo apt install ffmpeg)

# clone the repository
git clone https://github.com/MayNiklas/whisper_api.git

# change into the directory
cd whisper_api

# create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# prepare the environment
pip3 install -e '.[dev]'

# run the server from within the virtual environment
cd src/
uvicorn whisper_api:app --reload --host 127.0.0.1 --port 3001

# alternatively, you can use the following command to run the server
export PORT=3001
export LISTEN=127.0.0.1
whisper_api

NixOS

# clone the repository
git clone https://github.com/MayNiklas/whisper_api.git

# change into the directory
cd whisper_api

# run the server via nix (using CUDA)
nix run .#whisper_api_withCUDA

# enter the development shell providing the necessary environment
nix develop .#withCUDA

Settings

parameter	description	possible values	default
`PORT`	Port the API is available under	any number of port interval	3001
`LISTEN`	Address the API is available under	any IP or domain you own	127.0.0.1
`LOAD_MODEL_ON_STARTUP`	If model shall be loaded on startup	`1` (yes) or `0` (no)	1
`DEVELOP_MODE`	Develop mode defaults to smallest model to save time	`1` (yes) or `0` (no)	0
`UNLOAD_MODEL_AFTER_S`	If set the model gets unloaded after inactivity of t seconds, unset means no unload	any int (0 for instant unload)	'unset'
`DELETE_RESULTS_AFTER_M`	Time after which results are deleted from internal storage	any int	60
`REFRESH_EXPIRATION_TIME_ON_USAGE`	If result is used expand lifetime	`1` (yes) or `0` (no)	1
`RUN_RESULT_EXPIRY_CHECK_M`	Interval in which timeout checks shall be executed	any int (0 enables lazy timeout)	5
`USE_GPU_IF_AVAILABLE`	If GPU shall be used when available	`1` (yes) or `0` (no)	1
`MAX_MODEL`	Max model to be used for decoding, unset means best possible	name of official model	'unset'
`MAX_TASK_QUEUE_SIZE`	The limit of tasks that can be queued in the decoder at the same time before rejection	any int	128
`CPU_FALLBACK_MODEL`	The fallback when `MAX_MODEL` is not set and CPU mode is needed	name of official model	medium
`LOG_DIR`	The directory to store log-file(s) in "" means 'this directory', dir is created if needed	wanted directory name or empty str	"data/"
`LOG_FILE`	The name of the log file	arbitrary filename	whisper_api.log
`LOG_LEVEL_CONSOLE`	The name of the log file	arbitrary filename	whisper_api.log
`LOG_PRIVACY_MODE`	Don't display full task uuids and other sensitive data in the logs	`1` (yes) or `0` (no)	1
`LOG_LEVEL_FILE`	Level of logging for the file	"DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL")	"INFO"
`LOG_FORMAT`	Format of the log messages	any valid log message format	see below
`LOG_DATE_FORMAT`	Format of the date in log messages	any valid date format	"%d.%m. %H:%M:%S"
`LOG_ROTATION_WHEN`	Specifies when log rotation should occur	"S", "M", "H", "D", "W0"-"W6", "midnight"	"H"
`LOG_ROTATION_INTERVAL`	Interval at which log rotation should occur	any int	2
`LOG_ROTATION_BACKUP_COUNT`	Number of backup log files to keep	any int	48
`AUTHORIZED_MAILS`	Mail-addresses which are authorized to access special routes (whitespace separated)	any int	48

The log format is: "[{asctime}] [{levelname}][{processName}][{threadName}][{module}.{funcName}] {message}", using { as format specifier. All logging parameters follow pythons logging and the RotatingFileHandler specification.

LOG_AUTHORIZED_MAILS

The API provides a /logs route. That route provides all logs for download. The verification is done based on the 'X-Email' field in the request headers. A valid input would be: LOG_AUTHORIZED_MAILS="nik@example.com chris@example.com". Requests from localhost are currently always permitted (want an env-option to disable it? - make an issue).

Other privileged routes may come in the future.

Note

The system will automatically try to use the GPU and the best possible model when USE_GPU_IF_AVAILABLE and MAX_MODEL are not set.

CPU Mode

MAX_MODEL must be set when CUDA is not available or explicitly disabled via USE_GPU_IF_AVAILABLE. CPU_FALLBACK_MODEL is the fallback when GPU Mode shall use max-model but CPU shall be limited due to reduced performance.

Warning

If UNLOAD_MODEL_AFTER_S is set to 0 the model will not only be unloaded nearly instantly, it internally also results in busy waiting! All ints are assumed to be unsigned.

# enable development mode -> use small models
export DEVELOP_MODE=1

MayNiklas / whisper_api

readme

Whisper API

Features

Simple frontend

Resource friendly GPU support

Privacy focussed

Setup recommendations

Personal setup

SMB & research setup

How to deploy

Linux - docker

Linux

NixOS

Development

Linux

NixOS

Settings

LOG_AUTHORIZED_MAILS

Note

CPU Mode

Warning

Projects being used