UUDigitalHumanitieslab / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
6 stars 1 forks source link

Robots.txt #1509

Open JeltevanBoheemen opened 3 months ago

JeltevanBoheemen commented 3 months ago

Is your feature request related to a problem? Please describe. Because I-Analyzer now does not require a login, the application is vulnerable to crawling.

Describe the solution you'd like Provide a robots.txt with some sensible defaults. To be decided what these should be. @ar-jan seems to have some ideas about this?

ar-jan commented 3 months ago

Since our main concern is performance issues due to crawling, I think it's best to just disallow /search/ and keep it at that.

I think it's fine if no search results pages are included in crawl indices at all, since they are dynamic. There's no guarantee that a specific search phrase will still be located at a particular page of search results. If there's a benefit to having people find the website through the content of the corpora, we could include a sitemap with direct links to individual documents, but that would be a huge list.

So the simple solution would just be:

User-agent: *
Disallow: /search/