danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.75k stars 1.09k forks source link

Respect robots.txt / robots meta tag when crawling websites #1058

Open fabiangebert opened 4 months ago

fabiangebert commented 4 months ago

I'd suggest to implement some functionality to make the web crawler respect index / disallow settings as defined in the robots.txt or robots meta tags of the website that is being crawled.

See https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag

That would make it a lot safer to include external websites.

demarant commented 4 months ago

beside the metatag, the web crawler should also check the robots.txt in the root of the site There seems to be no check about robots.txt in the web connector https://github.com/danswer-ai/danswer/blob/main/backend/danswer/connectors/web/connector.py#L52 we need to add it here.

demarant commented 1 month ago

@Weves Robots.txt is now respected via PR #1538