Crawlers should have an option to respect robots.txt

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

https://crawlee.dev

Apache License 2.0

14.8k stars 615 forks source link

Crawlers should have an option to respect robots.txt #229

Open jakubbalada opened 5 years ago

jakubbalada commented 5 years ago

Now you have to write your own function to parse and respect target website's robots.txt file. Common function in an SDK (utils.js probably) for that would be great.

LeMoussel commented 5 years ago

I propose to use Robots Parser library with this common functions in utils.js :

getRobotsTxt(url)
isAllowedRobotsTxt(url, ua)
isDisallowedRobotsTxt(url, ua)
getCrawlDelayRobotsTxt(ua)

mgifford commented 2 years ago

Ok, so what is the status of this? It's not clear where this has been addressed (or why it hasn't yet).

mnmkng commented 2 years ago

It's not implemented. There were not that many users requesting it and it's easy enough to implement by the users who need it. We might add it in the future, but there's no timeline.

mgifford commented 2 years ago

Is this documented somewhere? I'm not interested in handcrafting one per URL, but rather looking at the robots.txt file and extracting the information about what to skip to be respectful of the site owner's direction.

mnmkng commented 2 years ago

See the comment above by LeMoussel.