Vali-98 / ChatterUI

Simple frontend for LLMs built in react-native.
GNU Affero General Public License v3.0
384 stars 20 forks source link

android Feature Request Voice mode. #23

Open nntb opened 4 months ago

nntb commented 4 months ago

I would like to see a voice mode where it listens then responds with a nice animation on the screen similer to chatgpts chat mode. maybe a setting to swap the animation for a animtion of the avatar of the charature. maybe have it pulse when the tts talks. maybe intagrate kaldi tts into it to allow onnx voices.

Vali-98 commented 4 months ago

This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.

nntb commented 4 months ago

sure thing just a idea. take it or leave it. its somthing i would personaly like to have.

Katehuuh commented 1 month ago

This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.

Voice mode/hands-free is a great feature, possibly use whisper.rn listed from ggerganov/whisper.cpp; by recording audio chunks, until no speech is detected.


Would be good to save RAM by using Q4/Q8 cache in llama.rn?

Vali-98 commented 1 month ago

This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.

Voice mode/hands-free is a great feature, possibly use whisper.rn listed from ggerganov/whisper.cpp; by recording audio chunks, until no speech is detected.

This was a consideration yes, the issue with many of these features is more about interfacing it smoothly.

Would be good to save RAM by using Q4/Q8 cache in llama.rn?

Q4/Q8 caching requires Flash Attention, which in turn needs CUDA which isn't available on android.

Katehuuh commented 1 month ago

This was a consideration yes, the issue with many of these features is more about interfacing it smoothly.

Agree, but as a priority, Users here likely run models locally over using a provider, which is already possible on any API Chat playground.

Q4/Q8 caching requires Flash Attention, which in turn needs CUDA which isn't available on android.

layla does have Flash Attention but no CUDA so "zero benefits"?
[r/LocalLLaMA Report: "Flash Attention feature yielded zero benefits"](https://www.reddit.com/r/LocalLLaMA/comments/1bsnifx/comment/l25mfui/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) ![layla-Flash-Attention_Screenshot](https://github.com/user-attachments/assets/887806f8-19d2-48a7-97bf-aa6629b7bfbc) --- Except for longer waits with long-context `prompt processing`, [ChatterUI](https://github.com/Vali-98/ChatterUI) on `Snapdragon 8 Gen 3` has been the fastest in `generation` compared to [maid](https://github.com/Mobile-Artificial-Intelligence/maid), [aub.ai](https://github.com/BrutalCoding/aub.ai) backed by _llama.cpp_, and even with GPU (Adreno) support in [ExecuTorch Alpha](https://pytorch.org/executorch/main/llm/llama-demo-android.html), [mlc-llm](https://github.com/mlc-ai/mlc-llm).
Vali-98 commented 1 month ago

Agree, but as a priority, Users here likely run models locally over using a provider, which is already possible on any API Chat playground.

Technically, a voice mode would be compatible with both local and API providers if implemented right. It really is all down to figuring out how to integrate it into the UI.

layla does have Flash Attention but no CUDA so "zero benefits"?

I did read into the flash attention implementation. Last I checked, there is a 'cpu' implementation for FA, but only for comparison testing with its cuda implementation with no actual perf benefit.

Except for longer waits with long-context prompt processing, ChatterUI on Snapdragon 8 Gen 3 has been the fastest in generation compared to maid, aub.ai backed by llama.cpp, and even with GPU (Adreno) support in ExecuTorch Alpha, mlc-llm.

Is this using Q4_0_4_8 models? That specific quantization format improves prompt processing somewhat. Technically all llama.cpp backed apps should perform identically, so there probably just is some misconfiguration somewhere on their part.

Katehuuh commented 1 month ago

Except for longer waits with long-context prompt processing

Is this using Q4_0_4_8 models

No intensive testing but I've used normal 7B-Q4_K_M.gguf, First load and processing of 1k+ token, long-context character-card and compare to maid.

Vali-98 commented 1 month ago

No intensive testing but I've used normal 7B-Q4_K_M.gguf

I'd suggest you also test Q4_0_4_8 due to its speed benefits on Snapdragon 8 devices. It's almost a 50% improvement to prompt processing.