Closed daniandtheweb closed 3 months ago
Would be an interesting addition but CPU backends or integrated graphics backends might have issues. IIRC, Vulkan doesn't support integrated graphics or doesn't have very good support for it. So an option to choose the backend should be a thing.
I would love to implement some form of acceleration because Llama.cpp on android is slow as
Older versions of maid produced output somewhat quickly depending on the model size even on my phone which is kind of old. Something changed in Llama.cpp?
Older versions of maid produced output somewhat quickly depending on the model size even on my phone which is kind of old. Something changed in Llama.cpp?
idek, Obviously the implementation of llama.cpp has changed alot in maid throughout the months ive been developing it. I was able to get the current implementation to run quite fast on mobile for a few of the small param models (tinyllama 2b, calypso 3b, orcamini 3b) but more development is needed to get isolates working correctly so the app doesnt lag for ages when you prompt
Ok as of 98533eaaa94e70605973faabc9428753bde88146 ive enabled Vulkan Acceleration on linux and android (windows is coming too). Ive tested it out on android,it seems noticeably faster but not jaw droppingly.
I've just tested it and using tinyllama it's way faster and totally usable now. However when I check gpu and cpu usage gpu usage stays below 15% so is it really using vulkan?
Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.
Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.
Its somewhat expected that it will be a little slower in maid compared to termux as maid has to process all the pre-prompt info (system prompt, previous messages, example messages, etc) prior to actual inference.
Though 5 minutes is far too long. What model are you using
Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.
Its somewhat expected that it will be a little slower in maid compared to termux as maid has to process all the pre-prompt info (system prompt, previous messages, example messages, etc) prior to actual inference.
Though 5 minutes is far too long. What model are you using
tinyllama-1.1b-chat, tinyllama-1.1b-1t-openorca, tiny-llama-openhermes-1.1b. I also tried stablelm-zephyr-3b but that doesn't even load with llama.cpp so that's not a maid issue.
ahh ok they're small enough they should be pretty fast. Could be that they maybe using an unsupported GGUF version, but if theyre working with the latest llama.cpp that shouldn't be the case
Now that vulkan gpu acceleration has been merged into llama.cpp is it possible to implement it here? It may give a good performance boost to the local backend.