It would be great to have an option for other quantized llm or (through Ollama) for the Voice-Chat.

CRCODE22 commented 4 weeks ago

jpgallegoar commented 4 weeks ago

Yeah it's possible. I could add a dropdown list to choose the LLM. I considered it but deemed it unnecessary and overcomplicating stuff since it's mainly just a tech demo. What LLM would you choose? Ollama integration could be done through the API, I would consider it out of scope for the Gradio app

Akossimon commented 4 weeks ago

i also was wondering how one can swap the voice chat LLM. for one the current one takes 5 to 10 minutes to answer,... let alone to speak , this when on my mac and running this in Pinokio. and secondly it keeps on blocking answers... kind of defeats the purpose to run all of this locally, if it blocks answers anyways, like online cloud chatbots as in Chatgpt etc etc . a drop downlist would fix all of this i think. what a great idea.... in the meantime is it possible to do this manually , to exchange that voice chat LLM for anything other out there, and if yes, how would such a thing be done? And which are compatible with your app ?

jpgallegoar commented 4 weeks ago

i also was wondering how one can swap the voice chat LLM. for one the current one takes 5 to 10 minutes to answer,... let alone to speak , this when on my mac and running this in Pinokio. and secondly it keeps on blocking answers... kind of defeats the purpose to run all of this locally, if it blocks answers anyways, like online cloud chatbots as in Chatgpt etc etc . a drop downlist would fix all of this i think. what a great idea.... in the meantime is it possible to do this manually , to exchange that voice chat LLM for anything other out there, and if yes, how would such a thing be done? And which are compatible with your app ?

Are you sure it's always that slow and not just the first time, when it needs to download the model? It's possible to change it in the code, it's just one line with the model name in huggingface hub. This is already a 3b parameter model so you can't really go any lower and expect anything decent in terms of quality, perhaps this feature is too heavy for your machine. Have you tried any other local LLMs which were faster on your machine? Which one would you like to use? It's possible to integrate an uncensored one like Dolphin but that would be even slower for you. You can also look into API LLM to make it faster

SWivid commented 4 weeks ago

@Akossimon basically you need sufficient gpu mem, otherwise the llm/tts model is running on your shared mem or cpu (probably) if you have tried with space demo, say huggingface/modelscope, it's just fast enough lol

Akossimon commented 4 weeks ago

mac Mini M2 Pr with 32GB ram..... i installed it inside of PINOKIO . the download took 20 minutes , yes, but so do the answers.. inside of the chatbot tab... then getting blocked answers makes it just not interesting any more. this is why i was hoping there are other LLMs that are uncensored and are smaller to reply faster. i am familiar with Checkpoints inside stable diffusion for gen Ai, but here, it is all a new vocabulary, and i would not know where to look for such chatbot LLMs and how to swap the current ones, is a mystery to me right now as well. are there Links you could recommend to these Dolphin ones, or such API LLM (never heard of either ones before), so i can experiment with them?..and Yes, i had OPEN-WEBUI once installed within Pinokio on my mac, with Ollama inside, that answered much faster.

jpgallegoar commented 4 weeks ago

@Akossimon Just to clarify, inside Batched TTS it takes a few seconds to generate audio and in Chat it takes 20m? Have you noticed if the text appears after the 20m (so it's definitely the LLM)? The LLM + TTS should fit inside 32gb Unified memory your Mac has, so the problem might be somewhere else. Please answer these questions to narrow down the issue

SWivid commented 4 weeks ago

mac Mini M2 Pr with 32GB ram.....

Yeah~ the point is to have a sufficient GPU mem; 32GB RAM is just having CPU do the inference stuff Once you tried with online space demo speed, you'll know that. If you missed that, they are at https://huggingface.co/spaces/mrfakename/E2-F5-TTS or https://modelscope.cn/studios/modelscope/E2-F5-TTS

Only text-to-text chat with Ollama will not consume much GPU mem, or you have succeeded with also voice chat (if so that would be nice if you could point us how the pipeline goes)?

To be specific, the voice chat will:

Load E2/F5 for TTS
Load Whisper to do ASR transcription
Load Qwen2.5 do text interact. stuff We should take all these into consideration, rather than just LLM. Another way: you could do manually tweak of code, only have current used module loaded (to do current inference stuff, ASR or LLM or TTS) and always offload other modules

Akossimon commented 4 weeks ago

i use the mic, and the record button, so to see my transcribed text from my aaudio recording , it takes about 30 seconds and upward , then i wait for an answer , usually 5 - 20 minutes, then it converts it to audio, that takes around 2 minutes and upward. i am speculating its the Qwen thats slow ?

SWivid commented 4 weeks ago

i use the mic, and the record button, so to see my transcribed text from my aaudio recording , it takes about 30 seconds and upward , then i wait for an answer , usually 5 - 20 minutes, then it converts it to audio, that takes around 2 minutes and upward. i am speculating its the Qwen thats slow ?

How many GPU mem do you have?

Akossimon commented 4 weeks ago

i have a mac.... i do not have any GPU that is being addressed , for Phyton cannot do this in macs, so it all runs always only on CPU only and on my 32GB ram.... at least thats all i know so far from what i learned , but i could be wrong

SWivid commented 4 weeks ago

i have a mac.... i do not have any GPU that is being addressed , for Phyton cannot do this in macs, so it all runs always only on CPU only and on my 32GB ram.... at least thats all i know so far from what i learned , but i could be wrong

Ah I got it. Try with: https://github.com/lucasnewman/f5-tts-mlx which is mentioned in acknowledgement in readme, but no idea if gradio could be transfer to that repo.

To that point, space demo is more with your need as it seems no critical need for deployment yourself? huggingface may have some limited quota if not pro user, but modelscope is also usable, try it !

Akossimon commented 4 weeks ago

i am at a huge loss unfortunately, for i have no idea what you are recommending , that link is a github project, so far i got it... but PINOKIO does installations for me, and i believe PINOKIO can not install a webadress link .

i never heard of the word gradio. also do not know what gradio could be transfer to that repo means.

modelscope is cloudcomputing , and running this on their clouds, right? ...

SWivid commented 4 weeks ago

modelscope is cloudcomputing , and running this on their clouds, right? ...

yes and it's free, why not just use

Akossimon commented 4 weeks ago

ok, let me try... thansk soooo much

Akossimon commented 4 weeks ago

https://www.modelscope.cn/studios/modelscope/E2-F5-TTS

0dram commented 3 weeks ago

Is there a way to change the model in the code? to maybe give it acces to LM Studio API? If so in which file does one have to look? :D

jpgallegoar commented 3 weeks ago

Is there a way to change the model in the code? to maybe give it acces to LM Studio API? If so in which file does one have to look? :D

Yes, in F5-TTS\src\f5_tts\infer\infer_gradio.py -> load_chat_model(). It is hardcoded as "Qwen/Qwen2.5-3B-Instruct", but you can change that using anything here https://huggingface.co/models?pipeline_tag=text-generation. Or create custom code for API

Fhd89 commented 2 weeks ago

Yes, in F5-TTS\src\f5_tts\infer\infer_gradio.py -> load_chat_model(). It is hardcoded as "Qwen/Qwen2.5-3B-Instruct", but you can change that using anything here https://huggingface.co/models?pipeline_tag=text-generation. Or create custom code for API

Thanks, this works. Can we use quantized GGUF models? I couldn't find what link to use in 'load_chat_model()' as there are multiple GGUF models inside a folder. Like for example here - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main

jpgallegoar commented 2 weeks ago

@Fhd89 Try using any of those names. Like this: model = AutoModel.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GGUF", filename="mistral-7b-instruct-v0.2.Q6_K.gguf")

Fhd89 commented 2 weeks ago

I am getting this error : OSError: TheBloke/Mistral-7B-Instruct-v0.2-GGUF does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.

SWivid / F5-TTS

It would be great to have an option for other quantized llm or (through Ollama) for the Voice-Chat. #287