Open CRCODE22 opened 4 weeks ago
Yeah it's possible. I could add a dropdown list to choose the LLM. I considered it but deemed it unnecessary and overcomplicating stuff since it's mainly just a tech demo. What LLM would you choose? Ollama integration could be done through the API, I would consider it out of scope for the Gradio app
i also was wondering how one can swap the voice chat LLM. for one the current one takes 5 to 10 minutes to answer,... let alone to speak , this when on my mac and running this in Pinokio. and secondly it keeps on blocking answers... kind of defeats the purpose to run all of this locally, if it blocks answers anyways, like online cloud chatbots as in Chatgpt etc etc . a drop downlist would fix all of this i think. what a great idea.... in the meantime is it possible to do this manually , to exchange that voice chat LLM for anything other out there, and if yes, how would such a thing be done? And which are compatible with your app ?
i also was wondering how one can swap the voice chat LLM. for one the current one takes 5 to 10 minutes to answer,... let alone to speak , this when on my mac and running this in Pinokio. and secondly it keeps on blocking answers... kind of defeats the purpose to run all of this locally, if it blocks answers anyways, like online cloud chatbots as in Chatgpt etc etc . a drop downlist would fix all of this i think. what a great idea.... in the meantime is it possible to do this manually , to exchange that voice chat LLM for anything other out there, and if yes, how would such a thing be done? And which are compatible with your app ?
Are you sure it's always that slow and not just the first time, when it needs to download the model? It's possible to change it in the code, it's just one line with the model name in huggingface hub. This is already a 3b parameter model so you can't really go any lower and expect anything decent in terms of quality, perhaps this feature is too heavy for your machine. Have you tried any other local LLMs which were faster on your machine? Which one would you like to use? It's possible to integrate an uncensored one like Dolphin but that would be even slower for you. You can also look into API LLM to make it faster
@Akossimon basically you need sufficient gpu mem, otherwise the llm/tts model is running on your shared mem or cpu (probably) if you have tried with space demo, say huggingface/modelscope, it's just fast enough lol
mac Mini M2 Pr with 32GB ram..... i installed it inside of PINOKIO . the download took 20 minutes , yes, but so do the answers.. inside of the chatbot tab... then getting blocked answers makes it just not interesting any more. this is why i was hoping there are other LLMs that are uncensored and are smaller to reply faster. i am familiar with Checkpoints inside stable diffusion for gen Ai, but here, it is all a new vocabulary, and i would not know where to look for such chatbot LLMs and how to swap the current ones, is a mystery to me right now as well. are there Links you could recommend to these Dolphin ones, or such API LLM (never heard of either ones before), so i can experiment with them?..and Yes, i had OPEN-WEBUI once installed within Pinokio on my mac, with Ollama inside, that answered much faster.
@Akossimon Just to clarify, inside Batched TTS it takes a few seconds to generate audio and in Chat it takes 20m? Have you noticed if the text appears after the 20m (so it's definitely the LLM)? The LLM + TTS should fit inside 32gb Unified memory your Mac has, so the problem might be somewhere else. Please answer these questions to narrow down the issue
mac Mini M2 Pr with 32GB ram.....
Yeah~ the point is to have a sufficient GPU mem; 32GB RAM is just having CPU do the inference stuff Once you tried with online space demo speed, you'll know that. If you missed that, they are at https://huggingface.co/spaces/mrfakename/E2-F5-TTS or https://modelscope.cn/studios/modelscope/E2-F5-TTS
Only text-to-text chat with Ollama will not consume much GPU mem, or you have succeeded with also voice chat (if so that would be nice if you could point us how the pipeline goes)?
To be specific, the voice chat will:
i use the mic, and the record button, so to see my transcribed text from my aaudio recording , it takes about 30 seconds and upward , then i wait for an answer , usually 5 - 20 minutes, then it converts it to audio, that takes around 2 minutes and upward. i am speculating its the Qwen thats slow ?
i use the mic, and the record button, so to see my transcribed text from my aaudio recording , it takes about 30 seconds and upward , then i wait for an answer , usually 5 - 20 minutes, then it converts it to audio, that takes around 2 minutes and upward. i am speculating its the Qwen thats slow ?
How many GPU mem do you have?
i have a mac.... i do not have any GPU that is being addressed , for Phyton cannot do this in macs, so it all runs always only on CPU only and on my 32GB ram.... at least thats all i know so far from what i learned , but i could be wrong
i have a mac.... i do not have any GPU that is being addressed , for Phyton cannot do this in macs, so it all runs always only on CPU only and on my 32GB ram.... at least thats all i know so far from what i learned , but i could be wrong
Ah I got it. Try with: https://github.com/lucasnewman/f5-tts-mlx which is mentioned in acknowledgement in readme, but no idea if gradio could be transfer to that repo.
To that point, space demo is more with your need as it seems no critical need for deployment yourself? huggingface may have some limited quota if not pro user, but modelscope is also usable, try it !
i am at a huge loss unfortunately, for i have no idea what you are recommending , that link is a github project, so far i got it... but PINOKIO does installations for me, and i believe PINOKIO can not install a webadress link .
i never heard of the word gradio. also do not know what gradio could be transfer to that repo means.
modelscope is cloudcomputing , and running this on their clouds, right? ...
modelscope is cloudcomputing , and running this on their clouds, right? ...
yes and it's free, why not just use
ok, let me try... thansk soooo much
Is there a way to change the model in the code? to maybe give it acces to LM Studio API? If so in which file does one have to look? :D
Is there a way to change the model in the code? to maybe give it acces to LM Studio API? If so in which file does one have to look? :D
Yes, in F5-TTS\src\f5_tts\infer\infer_gradio.py -> load_chat_model(). It is hardcoded as "Qwen/Qwen2.5-3B-Instruct", but you can change that using anything here https://huggingface.co/models?pipeline_tag=text-generation. Or create custom code for API
Yes, in F5-TTS\src\f5_tts\infer\infer_gradio.py -> load_chat_model(). It is hardcoded as "Qwen/Qwen2.5-3B-Instruct", but you can change that using anything here https://huggingface.co/models?pipeline_tag=text-generation. Or create custom code for API
Thanks, this works. Can we use quantized GGUF models? I couldn't find what link to use in 'load_chat_model()' as there are multiple GGUF models inside a folder. Like for example here - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main
@Fhd89 Try using any of those names. Like this: model = AutoModel.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GGUF", filename="mistral-7b-instruct-v0.2.Q6_K.gguf")
I am getting this error : OSError: TheBloke/Mistral-7B-Instruct-v0.2-GGUF does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.
It would be great to have an option for other quantized llm or (through Ollama) for the Voice-Chat.