Voice stuttering, Macbook Pro M1 16GB, what to change?

stevenbaert commented 7 months ago

Love this project! Was playing around with it. The voice works fine, but stutters. It starts correctly "This is how ..." then stops "voice x", stops "sounds like". What would you recommend to change?

Thanks for your input!

KoljaB commented 7 months ago

As of my knowledge M1 does not achieve the inference speed yet currently needed for realtime with Coqui TTS using the XTTS 2 model. I don't have a Mac, so I can't really experiment around how to improve the situation.

I think maybe if Metal Shaders on your Mac are available this here could moving the model to the GPU (open coqui_engine.py and exchange the device set code with that one here):

    if torch.cuda.is_available():
        logging.info("CUDA available, GPU inference used.")
        device = torch.device("cuda")
    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        logging.info("MPS available, GPU inference used.")
        device = torch.device("mps")
    else:
        logging.info("CUDA and MPS not available, CPU inference used.")
        device = torch.device("cpu")

Also maybe raising the stream_chunk_size from 20 to 200 can help to get bigger synthesized chunks.

Also you can make coqui_engine only synthesize full sentences:

    chunklist = []

    for i, chunk in enumerate(chunks):
        chunk = postprocess_wave(chunk)
        chunklist.append(chunk.tobytes())

    for chunk in chunklist:
        conn.send(('success', chunk))

I know these are all not great solutions, but I think Coqui just does not fully support M1 currently. I guess general pytorch support for Mac will get better with future versions (I think I read they are working on this). The libraries used (faster_whisper, llama nd coqui TTS) all use torch currently, I'm not sure if switching to a completely other model inference provider like tinygrad or tensorflow is an option for the future.

Maybe coqui will optimize their synthesis further, they improved a lot in the past. Two months ago I was not able to get realtime speed on my environment (gtx 2080, amd ryzen, 32GB DDR4). I hope it will get better soon.

KoljaB commented 7 months ago

I've just realized that Coqui also offers an excellent XTTS streaming server project. Their approach differs somewhat from mine. Therefore, if you can get their implementation working and experience similar stuttering issues, it likely indicates that the problem is beyond my control. However, if their version runs smoothly, it suggests that there might be a specific issue with my RealtimeTTS implementation for Mac, which I can then focus on resolving.

stevenbaert commented 7 months ago

Can you tell me where I can find this coqui_engine.py?

stevenbaert commented 7 months ago

Other option is to just use the default Mac voices(?) Btw, I plan to move to a Macbook M3, any idea if that is supported?

KoljaB commented 7 months ago

coqui_engine.py is in your site-packages installation folder. You can see it with "pip show realtimetts". You might need to install with "pip install -e realtimetts" to be able to edit- this is how it works on Win, not sure how it is on Mac. I'll update the package within the next two days with a new coqui_engine.py with new constructor parameters together with metal shader support for Mac.

    def __init__(self, 
                 model_name = "tts_models/multilingual/multi-dataset/xtts_v2",
                 cloning_reference_wav: str = "female.wav",
                 language = "en",
                 speed = 1.0,
                 thread_count = 6,       # <-
                 stream_chunk_size = 20, # <- these will allow for better customization for slower machines
                 full_sentences = False, # <-
                 level=logging.WARNING
                 ):

I would not put too much hope into this though, local neural TTS is still very demanding and these patches are basically only me picking for every straw to make the experience better.

KoljaB commented 7 months ago

New version released now with way more options in the coqui engine constructor (pls upgrade with "pip install --upgrade RealtimeTTS")

stevenbaert commented 7 months ago

Upgrading RealTimeTTS broke it. When I now run it I get error.

This is how voice number 3 sounds like /opt/homebrew/lib/python3.11/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( CoquiEngine: General synthesis error: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. occured trying to synthesize text This is how voice number 3 sounds like Traceback: Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/RealtimeTTS/engines/coqui_engine.py", line 230, in _synthesize_worker for i, chunk in enumerate(chunks): File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) ^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/TTS/tts/models/xtts.py", line 678, in inference_stream wav_gen = self.hifigan_decoder(gpt_latents, g=speaker_embedding.to(self.device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/TTS/tts/layers/xtts/hifigan_decoder.py", line 688, in forward z = torch.nn.functional.interpolate( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/functional.py", line 4006, in interpolate return torch._C._nn.upsample_linear1d(input, output_size, align_corners, scale_factors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ NotImplementedError: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Error: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. Exception in thread Thread-4 (synthesize_worker): Traceback (most recent call last): File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 982, in run self._target(*self._args, **self._kwargs) File "/opt/

KoljaB commented 7 months ago

Ok, made another new version with optional use_mps parameter for coqui engine so you can at least use it somehow.

Maybe also setting environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 together with mps works... but I read in the torch forum that it does not work for every missing op. They are working on metal shaders and are adding ops with every new release currently as far as I know, so I guess in the near future we will get mps accelerated tts for mac.

stevenbaert commented 7 months ago

Thanks, updating RealtimeTTS did work, but stuttering is still there. Also with the variable ...FALLBACK set to 1. Note I have also these warnings now:

John: Hi there. <<< Lina: Nice to meet you too. What brings you to this bar tonight? Asking for a friend who's also curious about the poker player sitting next to us. 😉 WARNING:root:Error fixing sentence end punctuation: string index out of range, Text: "" RealTimeSTT: root - WARNING - Error fixing sentence end punctuation: string index out of range, Text: ""

KoljaB commented 7 months ago

Thanks for reporting. RealtimeTTS 0.2.6 should fix the sentence end bug. Was caused by the emoji at the end.

There is not much more I can do against the stuttering right now on Mac. It will stop when setting full_sentences to True with the downside of having noticably greater latency for the first chunk of every sentence. Playing with thread_count and stream_chunk_size might improve it, but I doubt it will resolve it. We need full pytorch GPU support for mac, currently the dependent libraries (faster_whisper, llama and coqui TTS) all use torch only, so switching to tinygrad, tensorflow etc currently is no option.

Guess we have to wait until either coqui squeezes out a better synthesis performance (they did so a lot in the past, some weeks ago I wasn't able to get realtime speed on my 2080) or maybe if torch implements the aten::upsample_linear1d.out op into metal shaders the gpu support for mac will be sufficient for realtime factor < 1.

stevenbaert commented 7 months ago

Thanks for your feedback! Would like to understand better what you are doing here. Is there a comprehensive overview not to technical? I m especially interested in the speech to text engine and how well it's getting input (also in languages like Dutch, French?). Then the output ideally via this great tts but for now why not use the default Mac voices (they are ok too)?

KoljaB commented 7 months ago

I combine other peoples work by somehow glueing their libraries together 😉

For STT it's basically putting together of a fast transcription library (faster_whisper) with two good voice activity detection libraries (Silero VAD and webrtcvad). This then allows to detect spoken sentences quite well, and then with this now you can handle big amounts of streamed audio.

For TTS it's mostly preparation of the input texts to get sentence fragments that the engines can synthesize well (retrieve llm input streams until such a frag was found or splitting longer feeded texts into those frags).

The STT should work with dutch. Set the language to "nl" (i guess?) and maybe switch do a higher model, if the word error rate is too high.

Regarding default mac voices, I use pyttsx3 for the SystemEngine which was supposed to offer the native voices. I heard from another user though, that SystemEngine also caused issues on mac. I do not know yet if they appear on every mac and can be solved.

stevenbaert commented 6 months ago

An update, have my Macbook Pro M3 36GB now, tried default install and stuttering is less but still there. Any suggestions there to improve?

stevenbaert commented 4 months ago

Hi again, did a reinstall using git pull (using Macbook Pro M3).

Seems to go better, but not yet fully without stuttering. Note that I haven't installed CUDA since the link you give in the install is pointing to Windows or Linux only. This does impacts performance right?

Your input would be highly appreciated (any input which could improve the setup on Macbook Pro M3) since I really love this fully local open source voice chatting. Would love to, once I fully get it working without stuttering, add tools like web browsing.

Cuda not available llama_cpp_lib: return llama_cpp Initializing LLM llama.cpp model ... llama.cpp model initialized Initializing TTS CoquiEngine ...

Using model: xtts Initializing STT AudioToTextRecorder ... objc[56742]: Class AVFFrameReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x2ae4b0778) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x280e60370). One of the two will be used. Which one is undefined. objc[56742]: Class AVFAudioReceiver is implemented in both /opt/homebrew/lib/python3.11/site-packages/av/.dylibs/libavdevice.59.7.100.dylib (0x2ae4b07c8) and /opt/homebrew/Cellar/ffmpeg/6.1.1_3/lib/libavdevice.60.3.100.dylib (0x280e603c0). One of the two will be used. Which one is undefined. [2024-02-10 12:12:44.419] [ctranslate2] [thread 1113247] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

stevenbaert commented 4 months ago

Note that my GPU is not used at all while conversation is running:

KoljaB commented 4 months ago

The CoquiEngine also supports a parameter named use_deepspeed. If you can get deepspeed installed on Mac this is supposed to accelarate the synthesis (although I do not know if it works on CPU too). Otherwise I think there is only the thread_count parameter that would have the potential to speed things up.

I'd love to help more but I'm quite lost here too.

Maybe the community has some ideas about how to get Coqui TTS faster on CPU only? Or even better any way to get Coqui TTS working on GPU for Mac?

scalar27 commented 3 months ago

I read that deepspeed only works with cuda (nvidia), thus not on a Mac.

KoljaB / LocalAIVoiceChat

Voice stuttering, Macbook Pro M1 16GB, what to change? #2

Thanks, updating RealtimeTTS did work, but stuttering is still there. Also with the variable ...FALLBACK set to 1. Note I have also these warnings now: