Open meonkeys opened 1 year ago
I got an image built. It's not clean enough for a pull request but I'll share what I've got anyway. Maybe someone else can pick this up and contribute it (assuming the maintainers want it).
I'm just creating a Dockerfile
in a working copy (local clone) of this repository (HEAD at 2bdffc6b6e6e0d9ee8632dabf5009e995b31028d) and building with Docker. Here's the Dockerfile
:
# FIXME: Makes a huge image.
# TODO: Optimize with a multi-stage build, perhaps also using venv.
# Pin to 3.10-bookworm to get Python 3.10
# because https://github.com/MahmoudAshraf97/whisper-diarization/issues/90
FROM python:3.10-bookworm
ARG WD_USER=joe
ARG WD_UID=1000
ARG WD_GROUP=joe
ARG WD_GID=1000
# We rarely see a full upgrade in a Dockerfile. Why?
# && apt-get --assume-yes dist-upgrade \
RUN apt-get update \
&& apt-get --assume-yes --no-install-recommends install \
cython3 \
ffmpeg \
unzip \
wget \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /usr/src/app
COPY . .
RUN addgroup --gid $WD_GID $WD_GROUP \
&& adduser --uid $WD_UID --gid $WD_GID --shell /bin/bash --no-create-home $WD_USER \
&& chown -R $WD_USER:$WD_GROUP /usr/src/app
USER $WD_USER:$WD_GROUP
RUN mkdir venv \
&& python -m venv venv \
&& . venv/bin/activate \
&& pip install Cython \
&& pip install --no-cache-dir --requirement requirements.txt
Build with docker build --tag whisper-diarization .
The rest assumes a Bash shell on Linux or something close to / compatible with that.
As user joe
with UID 1000 and GID 1000, run with, for example:
BASE=$HOME/whisper-diarization
mkdir -p $BASE/data
mkdir -p $BASE/HOME_CACHE
mkdir -p $BASE/HOME_CONFIG
APP=/usr/src/app
mv /tmp/recording.mp3 data/
docker run --rm -it \
-v $BASE/data:/data \
-v $BASE/HOME_CONFIG:$APP/.config \
-v $BASE/HOME_CACHE:$APP/.cache \
--user joe:joe \
whisper-diarization \
bash
Now you're in the container at a non-root shell prompt, presumably. Run:
export HOME=/usr/src/app
source venv/bin/activate
python diarize_parallel.py -a /data/recording.mp3
exit
Now, inspect and manually clean up $BASE/data/recording.txt
on the host.
Don't forget the --gpus all
for docker run (if you want to use your GPU).
Just released "transcription stream" on GitHub today, which includes a docker image that runs diarize.py. Takes me about 15 minutes to build, but works great and is fast/automated. Would love to get your thoughts: https://github.com/transcriptionstream/transcriptionstream
It took me 30 minutes to build it and the 7.5GB size, but it works. Thanks for sharing :)
Just thought it would be handy to have a Docker image for this tool. I've been unable to get it working so far but I'll keep trying. If anyone else has it running in Docker, please share.