danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
10.19k stars 1.2k forks source link

Improve PDF text extraction #1938

Open jeremi opened 1 month ago

jeremi commented 1 month ago

Progress has been made on text extraction from PDF. It would be good to integrate a process like the one of https://github.com/VikParuchuri/marker and https://github.com/VikParuchuri/surya. That would allow the text to be better extracted. I understand that the licence of those projects would not be a good option, but there might be other similar models more open.

jeremi commented 1 month ago

I also discovered that could be an option: https://github.com/opendatalab/MinerU

emerzon commented 1 month ago

We've been also looking forward to use cloud-based services such as https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence or https://aws.amazon.com/textract/