KoljaB / LocalAIVoiceChat

Local AI talk with a custom voice based on Zephyr 7B model. Uses RealtimeSTT with faster_whisper for transcription and RealtimeTTS with Coqui XTTS for synthesis.
Other
478 stars 51 forks source link

Erro running on Mac M2 #15

Open vitorcalvi opened 3 months ago

vitorcalvi commented 3 months ago

First of all, awsome repo. I've tried all possible instalations combinations, had failed. Any suggests? @KoljaB Machine: Mac M2

Terminal output:

Using model: xtts Initializing STT AudioToTextRecorder ... [2024-06-05 15:39:29.914] [ctranslate2] [thread 1054526] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

Select voice (1-5): 1 This is how voice number 1 sounds like /opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( General synthesis error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:

Error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of:

Exception in thread Thread-4 (synthesize_worker): Traceback (most recent call last): File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/RealtimeTTS/text_to_stream.py", line 201, in synthesize_worker self.engine.synthesize(sentence) File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/site-packages/RealtimeTTS/engines/coqui_engine.py", line 411, in synthesize status, result = self.parent_synthesize_pipe.recv() File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Accept voice (y/n):

GPT4o output:

It appears that there are several warnings and errors related to the process of initializing the STT (Speech-to-Text) AudioToTextRecorder and selecting the voice. Here are the issues and their potential resolutions: Compute Type Warning:

[warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
Resolution: This is a warning indicating that the model initially designed to use float16 precision has been converted to float32 because the device or backend doesn't support float16 efficiently. This is usually not a critical issue, but if you want to optimize performance, consider using hardware that supports float16 or adjust the model configuration to use float32 from the start. Pretrained Model Configuration Warning:

UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
Resolution: Update your code to use a generation configuration file as suggested in the warning. This will ensure compatibility with future versions of the library. General Synthesis Error:

General synthesis error: isin() received an invalid combination of arguments - got (test_elements=int, elements=Tensor, ), but expected one of: (Tensor elements, Tensor test_elements, , bool assume_unique, bool invert, Tensor out) (Number element, Tensor test_elements, , bool assume_unique, bool invert, Tensor out) (Tensor elements, Number test_element, , bool assume_unique, bool invert, Tensor out) occured trying to synthesize text This is how voice number 1 sounds like
Resolution: This error indicates a type mismatch in the function call to isin(). Make sure that the arguments passed to isin() are of the correct type as specified in the error message. The elements should either be both Tensors or one should be a Tensor and the other a Number.

To proceed, you may need to: Verify and update the model and its configuration to ensure compatibility with the current hardware and software environment. Make sure that all function calls, particularly those involving Tensors, are using the correct types as expected by the functions.

If you need further assistance or specific code examples to resolve these issues, please provide more details about your setup and the code you're running.

KoljaB commented 3 months ago

This is due to new transformers library introducing an incompatibility to Coqui TTS (see here). Please downgrade to an older transformers version: pip install transformers==4.38.2 or upgrade RealtimeTTS to latest version pip install realtimetts==0.4.1

vitorcalvi commented 3 months ago

Thanks for the awnser :) Tested both solutions, only older transformers version works

Another two issues: -- > Using model: xtts Initializing STT AudioToTextRecorder ... [2024-06-05 16:20:24.534] [ctranslate2] [thread 27773] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.

-- Select voice (1-5): 1 This is how voice number 1 sounds like /opt/homebrew/anaconda3/envs/localAIVoiceCHat/lib/python3.9/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn(

This is due to new transformers library introducing an incompatibility to Coqui TTS (see here). Please downgrade to an older transformers version: pip install transformers==4.38.2 or upgrade RealtimeTTS to latest version pip install realtimetts==0.4.1

KoljaB commented 3 months ago

Thank you for feedback. Both warnings are absolutely normal and should not lead to any issues.

vitorcalvi commented 3 months ago

@KoljaB thank you. I forget another issue, speech cuts out every 1.5 to 2 seconds. Any suggests?

KoljaB commented 3 months ago

You may want to create CoquiEngine with full_sentences=True in the constructor on Mac M2 btw, because most Macs aren't fast enough for realtime synthesis with Coqui TTS (no GPU use possible).

coqui_engine = CoquiEngine(cloning_reference_wav="female.wav", language="en", speed=1.0, full_sentences=True)
vitorcalvi commented 3 months ago

Works like charm but as you've said, Macs aren't fast enough for RT Syth with Coqui TTS and the machine gots heavy Mac has Mlx framework and there's another TTS library MeloTTS mentioned on repo below https://github.com/huwprosser/jarvis-mlx

Thanks brow, see u