Vulkan acceleration - Githubissues

Mobile-Artificial-Intelligence / maid

Maid is a cross-platform Flutter app for interfacing with GGUF / llama.cpp models locally, and with Ollama and OpenAI models remotely.

MIT License

1.02k stars 98 forks source link

Vulkan acceleration #398

Closed daniandtheweb closed 3 months ago

daniandtheweb commented 4 months ago

Now that vulkan gpu acceleration has been merged into llama.cpp is it possible to implement it here? It may give a good performance boost to the local backend.

RookieIndieDev commented 4 months ago

Would be an interesting addition but CPU backends or integrated graphics backends might have issues. IIRC, Vulkan doesn't support integrated graphics or doesn't have very good support for it. So an option to choose the backend should be a thing.

danemadsen commented 4 months ago

I would love to implement some form of acceleration because Llama.cpp on android is slow as

RookieIndieDev commented 4 months ago

Older versions of maid produced output somewhat quickly depending on the model size even on my phone which is kind of old. Something changed in Llama.cpp?

danemadsen commented 4 months ago

Older versions of maid produced output somewhat quickly depending on the model size even on my phone which is kind of old. Something changed in Llama.cpp?

idek, Obviously the implementation of llama.cpp has changed alot in maid throughout the months ive been developing it. I was able to get the current implementation to run quite fast on mobile for a few of the small param models (tinyllama 2b, calypso 3b, orcamini 3b) but more development is needed to get isolates working correctly so the app doesnt lag for ages when you prompt

danemadsen commented 3 months ago

Ok as of 98533eaaa94e70605973faabc9428753bde88146 ive enabled Vulkan Acceleration on linux and android (windows is coming too). Ive tested it out on android,it seems noticeably faster but not jaw droppingly.

daniandtheweb commented 3 months ago

I've just tested it and using tinyllama it's way faster and totally usable now. However when I check gpu and cpu usage gpu usage stays below 15% so is it really using vulkan?

RookieIndieDev commented 3 months ago

Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.

danemadsen commented 3 months ago

Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.

Its somewhat expected that it will be a little slower in maid compared to termux as maid has to process all the pre-prompt info (system prompt, previous messages, example messages, etc) prior to actual inference.

Though 5 minutes is far too long. What model are you using

RookieIndieDev commented 3 months ago

Tried running llama.cpp directly on my android phone by building and running it with termux. It seems much faster than maid without any acceleration. More than 5 mins, still no output. Not sure what is wrong. I tried uninstalling and reinstalling maid, cleared the cache and the storage, yet I get nothing.

Its somewhat expected that it will be a little slower in maid compared to termux as maid has to process all the pre-prompt info (system prompt, previous messages, example messages, etc) prior to actual inference.

Though 5 minutes is far too long. What model are you using

tinyllama-1.1b-chat, tinyllama-1.1b-1t-openorca, tiny-llama-openhermes-1.1b. I also tried stablelm-zephyr-3b but that doesn't even load with llama.cpp so that's not a maid issue.

danemadsen commented 3 months ago

ahh ok they're small enough they should be pretty fast. Could be that they maybe using an unsupported GGUF version, but if theyre working with the latest llama.cpp that shouldn't be the case