containers / ramalama

The goal of RamaLama is to make working with AI boring.
MIT License
252 stars 40 forks source link

Vision models #150

Open p5 opened 1 month ago

p5 commented 1 month ago

Value Statement

As someone who wants a boring way to use AI I would like to expose an image/PDF/document to the LLM So that I can make requests and extract information, all within Ramalama

Notes

Various models now contain vision functionality, where they can ingest data from images, and answer questions about those images. Recently, the accuracy of these LLM-based OCR text extractions can exceed that of dedicated OCR tooling (even paid products like AWS Textract). The same vision models can also be used to extract information from PDF documents fairly easily after converting them to images.

We can use a similar interface to the planned Whisper.cpp implementation, since both are just contexts or data we provide to the LLMs. This has not been detailed anywhere, so below is a proposal/example of how it could look.

$ ramalama run --context-file ./document.pdf phi3.5-vision
>> When is this letter dated?
The date in the letter is `1st January 1999`

>> What is this document about?
This document is an instruction manual detailing how to use Ramalama, a cool new way to run LLMs (Large Language Models) across Linux and MacOS.  It supports text and vision-based models.

$ ramalama run --context-file ./painting.png phi3.5-vision
>> What is in the painting?
This is an abstract oil painting about something and something else.  It seems to be inspired by some artist.

The primary issue is neither ollama or llama.cpp support vision models at this moment, so would either need a custom implementation, or would require adding something like vllm.

ericcurtin commented 1 month ago

We had intended on merging vllm support soon, we started it here:

https://github.com/containers/ramalama/pull/97

this is what we think an outline of what it should look like, basically we want to introduce a --runtime flag, kinda like like the podman one that switches between crun, runc, krun, but in this case allows one to switch between llama.cpp, vllm, and whatever other runtimes people would like to integrate in future.

Above is a key feature we want, it's one of the reasons we don't simply use Ollama.

Now that we have a vllm v0.6.1 , we are ready to complete that work:

v0.6.1

Vision models like this would be useful for sure.

Personally I'm gonna be out a little bit in the next week or two, have a wedding and other things I need to take some time for.

Anybody who wants to pick up --runtime, vllm support, vision model support, like you @p5 or others, be my guest.

ericcurtin commented 1 month ago

@rhatdan merged the first vllm-related PR, I dunno if you want to take a stab at implementing the other things you had in mind @p5

rhatdan commented 3 weeks ago

@p5 still interested in this?

p5 commented 3 weeks ago

Hey Dan, Eric

My free time is very limited at the minute. Starting a new job in 2 weeks and there's a lot to get in order.

I still feel vision models would be a great addition to ramalama, but I'm going to be in a Windows-only environment :sigh: so unsure how much I'll be able to help out.

rhatdan commented 3 weeks ago

Thanks @p5, good luck with the new job.

ericcurtin commented 3 weeks ago

Best of luck @p5 @bmahabirbu did have success running on Windows recently:

https://github.com/containers/ramalama/tree/main/docs/readme

p5 commented 2 weeks ago

FYI - Ollama is now implementing vision models, so once v0.4 is released, it might be easier to integrate here.

ericcurtin commented 2 weeks ago

FYI - Ollama is now implementing vision models, so once v0.4 is released, it might be easier to integrate here.

Indirectly maybe, we inherit from the same backend llama.cpp, we don't actually use any Ollama stuff directly even though to a user it might appear that way!

p5 commented 2 weeks ago

Oh, apologies. I thought Ramalama used both llama.cpp and ollama runtimes 🤦 Now I can see you use Ollama's registry and transport, served via llama.cpp runtime.

ericcurtin commented 2 weeks ago

And we wrote the Ollama transport from scratch, so we use zero Ollama code.

What a lot of people don't realize is it's llama.cpp that does most of the heavy lifting for Ollama.