Add whisperx - Githubissues

Updates

Added whisperX. Reduced Speech-to-Text time cost from ~0.5s to ~0.13s.
Added diarization feature enabled by whisperX. So transcriptions can now come with speaker ids.
Used torchaudio instead of pydub to load audio streams. Reduced transcode time cost from 95ms to 9ms. So combined the transcription process is about 0.2s with whisperX. Previously ~0.6s with faster-whisper.
Added warm-up run before first sentence. Avoid the 2~4 sec overhead for the first round. 24/7 servers don't care about the first round, though.
To Do
Support pre-tokens and suppress tokens.
Test in non-web environments.
Think about the diarization API. Also persistent diarization throughout the conversation.
Test / optimize its VRAM usage. Especially during diarization.