Open nntb opened 4 months ago
This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.
sure thing just a idea. take it or leave it. its somthing i would personaly like to have.
This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.
Voice mode/hands-free is a great feature, possibly use whisper.rn listed from ggerganov/whisper.cpp; by recording audio chunks, until no speech is detected.
Would be good to save RAM by using Q4/Q8 cache in llama.rn?
This kind of integration is extremely difficult and will likely not be added anytime soon, but I will leave this issue up as a possible feature note. Thanks for using the app and for the suggestion.
Voice mode/hands-free is a great feature, possibly use whisper.rn listed from ggerganov/whisper.cpp; by recording audio chunks, until no speech is detected.
This was a consideration yes, the issue with many of these features is more about interfacing it smoothly.
Would be good to save RAM by using Q4/Q8 cache in llama.rn?
Q4/Q8 caching requires Flash Attention, which in turn needs CUDA which isn't available on android.
This was a consideration yes, the issue with many of these features is more about interfacing it smoothly.
Agree, but as a priority, Users here likely run models locally over using a provider, which is already possible on any API Chat playground.
Q4/Q8 caching requires Flash Attention, which in turn needs CUDA which isn't available on android.
Agree, but as a priority, Users here likely run models locally over using a provider, which is already possible on any API Chat playground.
Technically, a voice mode would be compatible with both local and API providers if implemented right. It really is all down to figuring out how to integrate it into the UI.
layla does have Flash Attention but no CUDA so "zero benefits"?
I did read into the flash attention implementation. Last I checked, there is a 'cpu' implementation for FA, but only for comparison testing with its cuda implementation with no actual perf benefit.
Except for longer waits with long-context prompt processing, ChatterUI on Snapdragon 8 Gen 3 has been the fastest in generation compared to maid, aub.ai backed by llama.cpp, and even with GPU (Adreno) support in ExecuTorch Alpha, mlc-llm.
Is this using Q4_0_4_8 models? That specific quantization format improves prompt processing somewhat. Technically all llama.cpp backed apps should perform identically, so there probably just is some misconfiguration somewhere on their part.
Except for longer waits with long-context prompt processing
Is this using Q4_0_4_8 models
No intensive testing but I've used normal 7B-Q4_K_M.gguf
, First load and processing of 1k+ token, long-context character-card and compare to maid.
No intensive testing but I've used normal 7B-Q4_K_M.gguf
I'd suggest you also test Q4_0_4_8 due to its speed benefits on Snapdragon 8 devices. It's almost a 50% improvement to prompt processing.
I would like to see a voice mode where it listens then responds with a nice animation on the screen similer to chatgpts chat mode. maybe a setting to swap the animation for a animtion of the avatar of the charature. maybe have it pulse when the tts talks. maybe intagrate kaldi tts into it to allow onnx voices.