Mozer / talk-llama-fast

Port of OpenAI's Whisper model in C/C++ with xtts and wav2lip
MIT License
749 stars 69 forks source link

Does not work with microphone #14

Closed sanjuhs closed 5 months ago

sanjuhs commented 5 months ago

So i followed the basic Installation process, I can start the server for extras as well as the xtts but the final talk.exe does not seem to be working , when i speak into the microphone nothing goes through to the app. I checked my microphone on discord and OBS and it seems to be working fine , please advice.

Once again apologies for the noob question , But Even with Mic on, nothing happens to the program kindly advice?

here is the terminal log for silly_extras.bat: `C:\Users\USER\Desktop\coding\python\realtime\talk-llama-fast-v0.1.3\SillyTavern-Extras>call conda activate extras Using torch device: cpu Initializing wav2lip module wav2lip: running init generation with default and silence.wav in wav2lip_server_generate: is busy: 0, face_detect_running: 0, chunk: 0, chunk_needed: 0, reply: 0 speech detected, wav2lip_server won't generate Deleting old temporary wavs and mp4s. No API key given because you are running locally.

Wav2lip videos can be played now.

WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

here is the start for the xtts_wav2lip.bat `C:\Users\USER\Desktop\coding\python\realtime\talk-llama-fast-v0.1.3\xtts>call conda activate xtts 2024-04-14 16:53:52.789 | INFO | xtts_api_server.modeldownloader:upgrade_tts_package:80 - TTS will be using 0.22.0 by Mozer 2024-04-14 16:53:52.789 | INFO | xtts_api_server.server::76 - Model: 'v2.0.2' starts to load,wait until it loads [2024-04-14 16:54:04,084] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-04-14 16:54:04,414] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-04-14 16:54:04,600] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.2+unknown, git-hash=unknown, git-branch=unknown [2024-04-14 16:54:04,601] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference [2024-04-14 16:54:04,601] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead [2024-04-14 16:54:04,602] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1 [2024-04-14 16:54:04,775] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1, 'rope_theta': 10000} 2024-04-14 16:54:05.268 | INFO | xtts_api_server.tts_funcs:load_model:190 - Pre-create latents for all current speakers 2024-04-14 16:54:05.268 | INFO | xtts_api_server.tts_funcs:get_or_create_latents:259 - creating latents for Anna: speakers/Anna.wav 2024-04-14 16:54:08.066 | INFO | xtts_api_server.tts_funcs:get_or_create_latents:259 - creating latents for default: speakers/default.wav 2024-04-14 16:54:08.109 | INFO | xtts_api_server.tts_funcs:get_or_create_latents:259 - creating latents for Google: speakers/Google.wav 2024-04-14 16:54:08.169 | INFO | xtts_api_server.tts_funcs:get_or_create_latents:259 - creating latents for Kurt Cobain: speakers/Kurt Cobain.wav 2024-04-14 16:54:08.234 | INFO | xtts_api_server.tts_funcs:create_latents_for_all:270 - Latents created for all 4 speakers. 2024-04-14 16:54:08.235 | INFO | xtts_api_server.tts_funcs:load_model:193 - Model successfully loaded C:\Users\USER\Miniconda3\envs\xtts\Lib\site-packages\pydantic_internal_fields.py:160: UserWarning: Field "modelname" has conflict with protected namespace "model".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = (). warnings.warn( INFO: Started server process [13136] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://localhost:8020 (Press CTRL+C to quit)`

and here is the one for talk-llama-wav2Lip.bat `C:\Users\USER\Desktop\coding\python\realtime\talk-llama-fast-v0.1.3>talk-llama.exe -mw ggml-medium.en-q5_0.bin -ml zephyr-7b-beta.Q4_K_S.gguf -p "Alex" --speak speak --vad-last-ms 200 --vad-start-thold 0.000270 --bot-name "Anna" --prompt-file assistant.txt --temp 1.15 --ctx_size 3548 --multi-chars --allow-newline --seqrep --stop-words Aleks:;alex:;---;ALex -ngl 99 -n 60 --threads 4 --split-after 5 --sleep-before-xtts 1000 Warning: c:\DATA\LLM\xtts\xtts_play_allowed.txt file not found, xtts wont stop on user speech without it whisper_init_from_file_with_params_no_state: loading model from 'ggml-medium.en-q5_0.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1024 whisper_model_load: n_audio_head = 16 whisper_model_load: n_audio_layer = 24 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1024 whisper_model_load: n_text_head = 16 whisper_model_load: n_text_layer = 2 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 4 (medium) whisper_model_load: adding 1608 extra tokens whisper_model_load: n_langs = 99 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes whisper_backend_init: using CUDA backend whisper_model_load: CUDA0 total size = 793.41 MB whisper_model_load: model size = 793.41 MB whisper_backend_init: using CUDA backend whisper_init_state: kv self size = 11.01 MB whisper_init_state: kv cross size = 12.29 MB whisper_init_state: compute buffer (conv) = 28.68 MB whisper_init_state: compute buffer (encode) = 594.22 MB whisper_init_state: compute buffer (cross) = 7.85 MB whisper_init_state: compute buffer (decode) = 98.31 MB llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from zephyr-7b-beta.Q4_K_S.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = huggingfaceh4_zephyr-7b-beta llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 14 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 217 tensors llama_model_loader: - type q5_K: 8 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Small llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.86 GiB (4.57 BPW) llm_load_print_meta: general.name = huggingfaceh4_zephyr-7b-beta llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 2 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 3877.55 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 3548 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 443.50 MiB llama_new_context_with_model: KV self size = 443.50 MiB, K (f16): 221.75 MiB, V (f16): 221.75 MiB llama_new_context_with_model: CUDA_Host input buffer size = 29.88 MiB llama_new_context_with_model: CUDA0 compute buffer size = 521.38 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.00 MiB llama_new_context_with_model: graph splits (measure): 3

WARNING: model is not multilingualrun: processing, 4 threads, lang = en, task = transcribe, timestamps = 0 ...

init: found 2 capture devices: init: - Capture device #0: 'CABLE Output (VB-Audio Virtual Cable)' init: - Capture device #1: 'Microphone (Logi C270 HD WebCam)' init: attempt to open default capture device ... init: obtained spec for input device (SDL Id = 2): init: - sample rate: 16000 init: - format: 33056 (required: 33056) init: - channels: 1 (required: 1) init: - samples per frame: 1024

run : initializing - please wait ... run : done! start speaking in the microphone Llama stop words: 'Alex:', 'Aleks:', 'alex:', '---', 'ALex',

Alex:`

image

Mozer commented 5 months ago

Check this issue https://github.com/Mozer/talk-llama-fast/issues/5 Try running with cmd, try 'listen to this device', try another mic (e.g. wo mic android app)

Upd: I also have VB-Audio CABLE installed, but no problem for me.

sanjuhs commented 5 months ago

thanks for guiding me through issue 5, i had to uninstall VBcable first from the install file then from device manager and then from the control panel on windows, then i reinstalled it. This made my microphone work properly again, i believe it was a microphone issue. Thank you. Also talking to the assistant in real time feels like something else truly !

Will close the issue thank you !