Request to support PDF scraping

BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL

https://www.builder.io/blog/custom-gpt

ISC License

18.15k stars 1.88k forks source link

Request to support PDF scraping #95

Open Zenpenguin opened 7 months ago

Zenpenguin commented 7 months ago

Hi, Thank you for this amazing repo. I am trying to use this on a website which also has 100s of pdfs. The crawler is unable to get the content from the PDFs. It fails with the error:

PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_ABORTED

It will be great if request for crawling through PDFs can be added as well.

Gorchakov-Pressure commented 6 months ago

How to skip files that come across from parsing?

isarikaya commented 6 months ago

How to skip files that come across from parsing?

You must specify which extensions you want to exclude in the config.ts file. resourceExclusions: []