Open fabiangebert opened 4 months ago
beside the metatag, the web crawler should also check the robots.txt in the root of the site There seems to be no check about robots.txt in the web connector https://github.com/danswer-ai/danswer/blob/main/backend/danswer/connectors/web/connector.py#L52 we need to add it here.
@Weves Robots.txt is now respected via PR #1538
I'd suggest to implement some functionality to make the web crawler respect index / disallow settings as defined in the robots.txt or robots meta tags of the website that is being crawled.
See https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag
That would make it a lot safer to include external websites.