When crawling the net, parse pdf documents as well

Significant-Gravitas / AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

https://agpt.co

MIT License

166.57k stars 44.08k forks source link

When crawling the net, parse pdf documents as well #514

Closed sanesanyo closed 1 year ago

sanesanyo commented 1 year ago

Duplicates

[X] I have searched the existing issues

Summary 💡

When crawling the web to do market research, a lot of links are sometimes just pdf documents. It would be great if Auto GPT had an inherent ability to parse those pdfs & feed the text for GPT4 to analyse.

Examples 🌈

Research on investing in Emerging Markets in 2023 --> The first few hits on Google Search are pdf documents. Auto GPT fails to parse them.

Motivation 🔦

This way Auto GPT can do the market research task far better than it currently can.

Boostrix commented 1 year ago

this would probably be a plugin to use a python pdf parsing library analogous to pdf2text (not sure how to mark/label the issue or if I am lacking permissions to do so)

anonhostpi commented 1 year ago

Agreed with @Boostrix on this one. PDF parsing is an extraneous task, and isn't as straightforward as it ought to be. It would be better to assign that to developers who are skilled in PDF parsing.

Boostrix commented 1 year ago

There already is PR #3031 which supports plain text based PDF processing.

that would also provide the option to support arguments, such as searching a PDF file based on authors, date, pages etc (which would return a list of pages/matches etc)

a higher level command would probably be an adaption of browse_website or to search specifically just for PDF files using different search engines/APIs (think research servers as per #826), as per: https://github.com/Significant-Gravitas/Auto-GPT/issues/503#issuecomment-1534094916

Probably covered by #2730

Plugin candidate, once the dust settles with #3652

github-actions[bot] commented 1 year ago

This issue was closed automatically because it has been stale for 10 days with no activity.