Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
26.96k stars 2.71k forks source link

[FEAT]: Direct conversion of pdf into png for vision LLM #2348

Open nekopep opened 1 month ago

nekopep commented 1 month ago

What would you like to see?

When I work with anything LLM each time I upload a doc it is automatically embedded by anythingLLM into the workspace. From my experience I get low quality result working with direct embedding. Thanks to your last commit to support MistralAi vision, I experimented using the same pdf and instead I did this:

pdftoppm -png -r 100 airmontenegro.pdf > airmontenegro_100.png

optipng -fix airmontenegro_100.png

and then I directly upload the image to the model and work with it. I found it more accurate.

Would it be possible to work directly with pdf as image for vision LLM or add something in UI to allow this?

This is related to a more global issue I have with working with anythingLLM and pdf docs. I think (and probbaly I'm compeltely wrong) that anythingLLM is thought as a chat interface able to ingest a ton of documents and work on this mass of data. My use case is more basic, usually I want to work on only one pdf and in this case I found the workspace UI difficult to use to get the result I want. Only today I get the result I wanted with this "pdf to image" trick.

This is a more general "last feature" missing to anythingLLM when users come from chatGPT and are used to make it ingest PDF. AnythingLLM will gently absorb the pdf and add it to the workspace BUT add it to all other PDF currently in the workspace (generally not a thing basic users wants). Even a /reset keep the docs, so if you work on different pdf iteration the workspace is messed up with all the pdf uploaded (because the /reset do not reset the file uploaded).

My feature request can be used for user to experiment pdf to image for vision LLM, allow basic user to work like in chatGPT and perhaps we could discuss in another ticket how to fix the more general usabilty issue I get with working with one shot pdf?

Example with pdf embedding: image Result: image

Example with image conversion and direct upload: image (Much) Better result: image

Thank you for any feedback

nekopep commented 1 month ago

Well, reviewing the pdf, perhaps my pdf was not the more suitable for the comparison, the vision LLM has a net advantage since it can analyze the data position to better understand the pdf.

Still, It make this feature request even more interesting ;)

nekopep commented 1 month ago

2301 seems related to my feedback on anythingLLM usage for direct PDF analysis