Wordcab / wordcab-transcribe

💬 ASR FastAPI server using faster-whisper and Multi-Scale Auto-Tuning Spectral Clustering for diarization.
https://wordcab.github.io/wordcab-transcribe/
MIT License
191 stars 26 forks source link

Issue building DockerFile #315

Open hobodrifterdavid opened 3 months ago

hobodrifterdavid commented 3 months ago

Hello. This project looks very interesting. I hit some issues building the Dockerfile as described in the readme:

docker build -t wordcab-transcribe:latest .

docker run -d --name wordcab-transcribe \
    --gpus all \
    --shm-size 1g \
    --restart unless-stopped \
    -p 5001:5001 \
    -v ~/.cache:/root/.cache \
    wordcab-transcribe:latest

On the first machine (Ubuntu Server 22 LTS, 4x 3090), the build process completed, but I got an 'illegal memory access' error, I think from a CUDA library, when starting up. This machine previously had a modified nvidia driver for P2P access, so it's possible it's not your issue. (https://github.com/tinygrad/open-gpu-kernel-modules/issues/4)

On the second machine (Ubuntu Server 22 LTS, 1x 3090), initially I had an error about the specific version of openssl not being available or compatible, I removed the version number specified in the Dockerfile, and the build continued. But the latest error is "ModuleNotFoundError: No module named 'IPython'"

Just a heads up, ideally I'd be able to help you debug.

image

aleksandr-smechov commented 3 months ago

@hobodrifterdavid Thanks for bringing up the issue. The documentation is a bit outdated. Can you please try the latest main branch and this Docker command instead:

docker run --name wordcab-transcribe --gpus all --shm-size 1g --restart unless-stopped -p 5001:5001 -e WORDCAB_TRANSCRIBE_API_KEY="x" -e WHISPER_MODEL="medium" -e WHISPER_ENGINE="faster-whisper-batch" -e ALIGN_MODEL="tiny" -e DIARIZATION_BACKED="longform-diarizer" -e COMPUTE_TYPE="float16" -e DEBUG="True" -e USERNAME="admin" -e PASSWORD="password" -e OPENSSL_KEY="0123456789abcdefghijklmnopqrstuvwyz" -e WINDOW_LENGTHS="2.0,1.5,1.0,0.75,0.5" -e SHIFT_LENGTHS="1.0,0.75,0.625,0.5,0.25" -e TENSORRT_LLM_VERSION="0.9.0.dev2024032600" wordcab-transcribe

The environment variables are from the .env file, feel free to customize.

hobodrifterdavid commented 3 months ago

On the second machine, I'm able to build if I add ipython to requirements.txt. The 'docker run' command in the readme does start the container sucessfully, and I'm able to process a request, but it errors out if I try to use the VAD. It seems okay with the updated command you sent. On the first machine, still illegal memory access, but I will wipe the machine and try again.

image

I got a few questions. :)

Is there a preferred backend for processing a long file over multiple GPUs?

In your docs, TensorRT-LLM doesn't allow passing a prompt. The prompt is useful for nudging the model towards outputing zh-CN or zh-TW, as there is only a single supported Chinese language code for whisper. Although, I guess machine translation as a post-processing step might be reasonable way to handle this.

Faster-Whisper has a length_penalty parameter that I understand increases the probabilty of the 'end of segment' token, the longer the segment gets. I think it's useful for pushing the output towards making shorter segments/subs. Could it be exposed in the API? The current output often gives segments that are too long to show as subtitles. btw, I noticed today that stable-ts has a set of functions for splitting and merging subs, although a proper sentence segmenter would additionally be helpful.

aleksandr-smechov commented 3 months ago

@hobodrifterdavid I noticed the missing IPython as well, check out the latest main branch that I just pushed, should resolve a few issues.

I kind of now prefer the Whisper engine I just added, faster-whisper-batched, which adds a bunch of unmerged PRs from the faster-whisper library that make things go fast.

Use the edited docker run command above and head to the FastAPI docs, where the first audio file endpoint should have the length_penalty parameter. I recommend setting batch_size to 4 or 8 at least, and num_beams to 5. given your GPU.

FastAPI docs are a bit weird for list input, so if you want to add vocab you'll need to use curl or requests with the audio file endpoint. You can use the audio-url endpoint and add vocab and the other parameters in the JSON, but you'll need a presigned URL to test that.

hobodrifterdavid commented 3 months ago

I wiped the first machine, it runs fine now. I didn't see the length_penalty param in the docs yet.

The Silero VAD is used? Do you know how it compares to other VADs (nemo etc.), in different languages?

hobodrifterdavid commented 2 months ago

I think you might have not pushed the length_penalty. 👀🙂