Robots.txt - Githubissues

UUDigitalHumanitieslab / I-analyzer

The great textmining tool that obviates all others

MIT License

6 stars 1 forks source link

Since our main concern is performance issues due to crawling, I think it's best to just disallow /search/ and keep it at that.

I think it's fine if no search results pages are included in crawl indices at all, since they are dynamic. There's no guarantee that a specific search phrase will still be located at a particular page of search results. If there's a benefit to having people find the website through the content of the corpora, we could include a sitemap with direct links to individual documents, but that would be a huge list.

So the simple solution would just be:

User-agent: *
Disallow: /search/

UUDigitalHumanitieslab / I-analyzer

Robots.txt #1509