eyurtsev / kor

LLM(😽)
https://eyurtsev.github.io/kor/
MIT License
1.57k stars 88 forks source link

Parse structured data from an Image #301

Open devtanna opened 3 weeks ago

devtanna commented 3 weeks ago

Hello 👋

Now that many models support image input as part of the prompt, what do you think of kor having support for parsing data from images? I would love to try and put up a draft PR :)

The typical use case would be, user inputs a pdf invoice, it's converted to an image, image is input to kor for data extraction. Currently, the pdf is converted to text and then input to kor for data extraction. The image flow is really advantageous when the document has handwritten parts.

eyurtsev commented 1 week ago

@devtanna I'd be happy to review, but I'm not sure that kor style input is helpful here since kor mostly helps with defining reference examples when the inputs are text.

I would suggest trying langchain with a prompt that contains an image and using a pydantic model as a tool (if there are multi modal models that also support function calling). What do you think?