beeldengeluid / dane-asr-worker

DANE worker for processing ASR (optimised for Dutch)
GNU General Public License v3.0
0 stars 0 forks source link

dane-asr-worker

Important NOTE: DANE is currently taken out of the code. If you want to run this worker with the old DANE code intact: use the following image: ghcr.io/beeldengeluid/dane-asr-worker:sha-f9197d8

Once we are sure DANE can be taken out completely, we will also rename this repository and remove all references to dane. For now we still call it the dane-asr-worker.

Development status

Configuration

.env.override is used to configure the worker and pass the input & output variables. Create this file by copying .env and changing the values. Notes on what each value means are also in the .env file.

Besides the environment, it's required to make sure the following 2 directories are available:

Data dir

The data dir is structured as follows:

Note that the last bit will probably be changed later on. Most likely data/ouput/{FILE_ID}/1Best.ctm|1Best.txt|transcript.json will be uploaded uncompressed.

Note You can download sample data here

Model dir

Kaldi_NL will check the models dir on startup to see if the (Utwente + Radbout) models were already downloaded. If not these models will be downloaded.

Note that Kaldi_NL also will try to create symlinks in the models dir, which will fail (most definitely in OpenShift) if the process does not have the right permissions. For this reason the docker-compose files in this repo are set to run as root.

Also note that the Kaldi_NL model download will run on an average laptop, but the speech recognition process will not work with less than 16Gb of RAM.

Docker

docker build -t dane-asr-worker -f dane-worker.Dockerfile .

Run the worker with:

docker compose -f docker-compose-dane-worker.yml up

Python (run with sample data)

Install the Python virtual env with all required packages:

poetry install

Enter the virtual env:

poetry shell

Test the worker code:

./scripts/check-project.sh

Run the worker:

./scripts/run.sh

CLI arguments:

Run with sample data

You can download sample data here. Make sure to put it in the data directory within your local copy of this repo.

Make sure to configure your .env.override with:

AUDIO_SAMPLE_URL=http://fake-hosting.beng.nl/2101608150135908031__NOS_JOURNAAL_-WON01207359.mp4

Since ./data/2101608150135908031__NOS_JOURNAAL_-WON01207359.mp4 already exists, you can test that the worker will skip trying to download the data from that --input-uri

Also the worker should see that also the Kald_NL output already exists and will skip calling Kaldi_NL as well (see the run function in simple_asr.py to follow the workers current processing logic)