Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
27.8k stars 2.81k forks source link

[FEAT]: Chunking of PDFs #592

Open sumitsodhi88 opened 10 months ago

sumitsodhi88 commented 10 months ago

What would you like to see?

The PDF when uploaded has some issue:

  1. chunks are more compared to txt - probably due to the formatting of the pdf file - the /n are more an lines are not continuous making the file large and making more cunks.
  2. the superscripts/ headers and footers of pdfs in the pdfs are also included causing lot of confusion to the llm - if I ask like "tell me about section 2" -I know a lot of cleaning has to be done before uploading a file - but I can't expect all managers to do that who will only finally complain that the LLM is not good. Screenshot 2024-01-14 100113
sumitsodhi88 commented 10 months ago

it seems properly fomatted Word files are working fine.

/n in the PDFs affects the context not being able to be identified it seems.

Employee9833 commented 10 months ago

Hi, I also have the same issue with pdf not parsing them correctly.

jainpradeep commented 6 months ago

Did anyone solve this issue?