Mozer / talk-llama-fast

Port of OpenAI's Whisper model in C/C++ with xtts and wav2lip
MIT License
708 stars 64 forks source link

talk-llama-fast

based on talk-llama https://github.com/ggerganov/whisper.cpp

Видео-инструкция на русском (Russian guide, English subs): https://youtu.be/0MEZ84uH4-E

English demo video, v0.1.3: https://www.youtube.com/watch?v=ORDfSG4ltD4

Видео на русском, v0.1.0: https://youtu.be/ciyEsZpzbM8

ТГ: https://t.me/tensorbanana

I added:

I used:

News

Notes

Languages

Whisper STT supported languages: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

XTTSv2 supported languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).

Mistral officially supports: English, French, Italian, German, Spanish. But it can also speak some other languages, but not so fluent (e.g. Russian is not officially supported, but it is there).

Requirements

Installation

For Windows 10/11 x64 with CUDA.

install miniconda. During installation make sure to check "Add Miniconda3 to my PATH environment variable" - it's important.

Open \xtts\ folder where you extracted talk-llama-fast-v0.1.3.zip. In this folder open a cmd and run line by line:

conda create -n xtts
conda activate xtts
conda install python=3.11
conda install git

pip install git+https://github.com/Mozer/xtts-api-server pydub
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install git+https://github.com/Mozer/tts
conda deactivate

git clone https://github.com/Mozer/SillyTavern-Extras cd SillyTavern-extras pip install -r requirements.txt cd modules git clone https://github.com/Mozer/wav2lip cd wav2lip pip install -r requirements.txt conda deactivate


- Notice: that \wav2lip\ was installed inside \SillyTavern-extras\modules\ folder. That's important.
- Edit xtts_wav2lip.bat, change `--output` from c:\\DATA\\LLM\\SillyTavern-Extras\\tts_out\\ to actual path where your \\SillyTavern-Extras\\tts_out\\ dir is located. Don't forget the trailing slashes here.
- Optional: if you have just 6 or 8 GB of vram - in talk-llama-wav2lip.bat find and change to `-ngl 0`. It will move mistral from GPU to CPU+RAM.
- Optional: edit talk-llama-wav2lip.bat or talk-llama-wav2lip-ru.bat, make sure it has correct LLM and whisper model names that you downloaded. (Full params description is below).
- Download [ffmpeg full](https://www.gyan.dev/ffmpeg/builds/), put into your PATH environment (how to: https://phoenixnap.com/kb/ffmpeg-windows). Then download h264 codec .dll of required version from https://github.com/cisco/openh264/releases and put to /system32 or /ffmpeg/bin dir. In my case for Windows 11 it was openh264-1.8.0-win64.dll. Wav2lip will work without this dll but will print an error.

## Running
- In /SillyTavern-extras/ double click `silly_extras.bat`. Wait until it downloads wav2lip checkpoint and makes face detection for new video if needed.
- In /xtts/ double click `xtts_wav2lip.bat` to start xtts server with wav2lip video. OR run xtts_streaming_audio.bat to start xtts server with audio without video. NOTE: On the first run xtts will download DeepSpeed from github. If deepspeed fails to download "Warning: Retyring (Retry... ReadTimoutError...") - turn on VPN to download deepspeed (27MB) and xtts checkpoint (1.8GB), then you can turn it off). Xtts checkpoint can be downloaded without VPN. But if you interrupt download - checkpoint will be broken - you have to manually delete \xtts_models\ dir and restart xtts. 
- Double click `talk-llama-wav2lip.bat` or `talk-llama-wav2lip-ru.bat` or talk-llama-just-audio.bat. Don't run exe, just bat. NOTE: if you have cyrillic (Russian) letters in .bat - save it in Cyrillic "OEM 866" encoding (notepadd++ supports it).
- Start speaking. 

### Tweaks for 6 and 8 GB vram
- use CPU instead of GPU, it will be a bit slower (5-6 s): in talk-llama-wav2lip.bat find and change ngl to `-ngl 0` (mistral has 33 layers, try values from 0 to 33 to find best speed)
- set smaller context for llama: `--ctx_size 512`
- set `--lowvram` in xtts_wav2lip.bat, that will move xtts model from GPU to RAM after each xtts request (but it will be slower)
- set `--wav-chunk-sizes=9999` in xtts_wav2lip.bat it will be a bit slower, but will have less wav2lip requests.
- try smaller whisper mode, for example [small](https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-small-q5_1.bin) or [english distilled medium](https://huggingface.co/distil-whisper/distil-medium.en/blob/main/ggml-medium-32-2.en.bin)

### Optional
- Put new xtts voices into `\xtts\speakers\`. I recommend  16 bit mono, 22050Hz 10 seconds long wav without noises and music. Use audacity to edit.
- Put new videos into `\SillyTavern-extras\modules\wav2lip\input\`. I recommend 300x400 25 fps 1 minute long, don't put high res vids, they use A LOT of vram. One video into one folder. Folder name should be the same as desired xtts voice name and a char name in talk-llama-wav2lip.bat. E.g. Anna.wav and \Anna\youtube_ann_300x400.mp4 for character with the name Anna. With `--multi-chars` param talk-llama will pass name of the new character to xtts and wav2lip even if this character is not defined in bat or start prompt. If xtts won't find that voice it will use default voice. If wav2lip won't find that video it will use default video.
- Put character descrition and some replies to assistant.txt. 
- Use exact same name for your character and for .wav file and for video folder name. You can also make copies of audio/video files (e.g. Kurt Cobain and Kurt). Now you can address him both ways.
- For better Russian in XTTS check my finetune: https://huggingface.co/Ftfyhh/xttsv2_banana But it is not for streaming (hallucinates at short replies). Use with default xtts in silly tavern.

#### Optional, better coma handling for xtts - only for xtts audio without wav2lip video
Better speech, but a little slower for first sentence. Xtts won't split sentences by coma ',':
c:\Users\[USERNAME]\miniconda3\Lib\site-packages\stream2sentence\stream2sentence.py
line 191, replace 
```sentence_delimiters = '.?!;:,\n…)]}。'```
with
```sentence_delimiters = '.?!;:\n…)]}。'```

#### Optional, google search plugin
- download [search_server.py](https://github.com/Mozer/talk-llama-fast/blob/master/search_server.py) from my repo
- install langchain: `pip install langchain`
- sign up at https://serper.dev/api-key it is free and fast, it will give you 2500 free searches. Get an API key, paste it to search_server.py at line 13 `os.environ["SERPER_API_KEY"] = "your_key"`
- start search server by double clicking search_server.py. Now you can use voice commands like these: `Please google who is Barack Obama` or `Пожалуйста погугли погоду в Москве`.

## Building, optional
- for nvidia and Windows. Other systems - try yourself.
- download https://www.libsdl.org/release/SDL2-devel-2.28.5-VC.zip extract to /whisper.cpp/SDL2/ folder
- install libcurl using vcpkg:

git clone https://github.com/Microsoft/vcpkg.git cd vcpkg ./bootstrap-vcpkg.sh ./vcpkg integrate install vcpkg install curl[tool]

- Modify path `c:\\DATA\\Soft\\vcpkg\\scripts\\buildsystems\\vcpkg.cmake` below to folder where you installed vcpkg. Then build.

git clone https://github.com/Mozer/talk-llama-fast cd talk-llama-fast set SDL2_DIR=SDL2\cmake cmake.exe -DWHISPER_SDL2=ON -DWHISPER_CUBLAS=1 -DCMAKE_TOOLCHAIN_FILE="c:\DATA\Soft\vcpkg\scripts\buildsystems\vcpkg.cmake" -B build cmake.exe --build build --config release --target clean del build\bin\Release\talk-llama.exe & cmake.exe --build build --config release

for old CPU's without AVX2: cmake.exe -DWHISPER_NO_AVX2=1 -DWHISPER_SDL2=ON -DWHISPER_CUBLAS=1 -DCMAKE_TOOLCHAIN_FILE="c:\DATA\Soft\vcpkg\scripts\buildsystems\vcpkg.cmake" -B build


## talk-llama.exe params

-h, --help [default] show this help message and exit -t N, --threads N [4 ] number of threads to use during computation -vms N, --voice-ms N [10000 ] voice duration in milliseconds -c ID, --capture ID [-1 ] capture device ID -mt N, --max-tokens N [32 ] maximum number of tokens per audio chunk -ac N, --audio-ctx N [0 ] audio context size (0 - all) -ngl N, --n-gpu-layers N [999 ] number of layers to store in VRAM -vth N, --vad-thold N [0.60 ] voice avg activity detection threshold -vths N, --vad-start-thold N [0.000270] vad min level to stop tts, 0: off, 0.000270: default -vlm N, --vad-last-ms N [0 ] vad min silence after speech, ms -fth N, --freq-thold N [100.00 ] high-pass frequency cutoff -su, --speed-up [false ] speed up audio by x2 (reduced accuracy) -tr, --translate [false ] translate from source language to english -ps, --print-special [false ] print special tokens -pe, --print-energy [false ] print sound energy (for debugging) -vp, --verbose-prompt [false ] print prompt at start --verbose [false ] print speed -ng, --no-gpu [false ] disable GPU -p NAME, --person NAME [Georgi ] person name (for prompt selection) -bn NAME, --bot-name NAME [LLaMA ] bot name (to display) -w TEXT, --wake-command T [ ] wake-up command to listen for -ho TEXT, --heard-ok TEXT [ ] said by TTS before generating reply -l LANG, --language LANG [en ] spoken language -mw FILE, --model-whisper [models/ggml-base.en.bin] whisper model file -ml FILE, --model-llama [models/ggml-llama-7B.bin] llama model file -s FILE, --speak TEXT [./examples/talk-llama/speak] command for TTS --prompt-file FNAME [ ] file with custom prompt to start dialog --session FNAME file to cache model state in (may be large!) (default: none) -f FNAME, --file FNAME [ ] text output file name --ctx_size N [2048 ] Size of the prompt context -b N, --batch-size N [64 ] Size of input batch size -n N, --n_predict N [64 ] Max number of tokens to predict --temp N [0.90 ] Temperature --top_k N [40.00 ] top_k --top_p N [1.00 ] top_p --min_p N [0.00 ] min_p --repeat_penalty N [1.10 ] repeat_penalty --repeat_last_n N [256 ] repeat_last_n --xtts-voice NAME [emma_1 ] xtts voice without .wav --xtts-url TEXT [http://localhost:8020/] xtts/silero server URL, with trailing slash --xtts-control-path FNAME [c:\DATA\LLM\xtts\xtts_play_allowed.txt] not used anymore --xtts-intro [false ] xtts instant short random intro like Hmmm. --sleep-before-xtts [0 ] sleep llama inference before xtts, ms. --google-url TEXT [http://localhost:8003/] langchain google-serper server URL, with / --allow-newline [false ] allow new line in llama output --multi-chars [false ] xtts will use same wav name as in llama output --push-to-talk [false ] hold Alt to speak --seqrep [false ] sequence repetition penalty, search last 20 in 300 --split-after N [0 ] split after first n tokens for tts --min-tokens N [0 ] min new tokens to output --stop-words TEXT [ ] llama stop w: separated by ;



## Voice commands:
Full list of commands and variations is in `talk-llama.cpp`, search `user_command`.
- Stop (остановись, Ctrl+Space)
- Regenerate (переделай, , Ctrl+Right) - will regenerate llama answer
- Delete (удали, Ctrl+Delete) - will delete user question and llama answer.
- Delete 3 messages (удали 3 сообщениия)
- Reset (удали все, Ctrl+R) - will delete all context except for a initial prompt
- Google something (погугли что-то)
- Сall NAME (позови Алису)

## Known bugs
- if you have missing cuda .dll errors - see this [issue](https://github.com/Mozer/talk-llama-fast/issues/5)
- if whisper doesn't hear your voice - see this [issue](https://github.com/Mozer/talk-llama-fast/issues/5)
- Rope context - is not implemented. Use context shifting (enabled by default).
- sometimes whisper is hallucinating, need to put hallucinations into stop-words. Check `misheard text` in `talk-llama.cpp`
- don't put cyrillic (русские) letters for characters or paths in .bat files, they may not work nice because of weird encoding. Copy text from .bat, paste into `cmd` if you need to use cyrillic letters with talk-llama-fast.exe.
- During first run wav2lip will run face detection with a newly added video. It will take about 30-60 s, but it happens just once and then it is saved to cache. And there is a bug with face detection wich slows down everything (memory leak). You need to restart Silly Tavern Extras after face detection is finished.
- Sometimes wav2lip video window disappears but audio is still playing fine. If the video window doesn't come back automatically - restart Silly Tavern Extras.
- if you restart xtts you need to restart silly-tavern-extras. Otherwise wav2lip will start playing wrong segments of already created videos.
- Sometimes when you type fast the first letter of you message is not printed.

## Licenses
- talk-llama-fast - MIT License - OK for commercial use
- whisper.cpp - MIT License - OK for commercial use
- whisper - MIT License - OK for commercial use
- TTS(xtts) - Mozilla Public License 2.0 - OK for commercial use
- xtts-api-server - MIT License - OK for commercial use
- Silly Extras - GNU Public License v3.0 - OK for commercial use
- Mistral 7B - Apache 2.0 license - OK for commercial use
- Wav2Lip - non-commercial - for commercial requests, please contact synclabs.so directly

## Contacts
Reddit: https://www.reddit.com/user/tensorbanana2

ТГ: https://t.me/tensorbanana

Donate: https://github.com/Mozer/donate