PrivateBin / Directory

Rust based directory application to collect list of federated instances
https://privatebin.info/directory/
Other
25 stars 8 forks source link

Adhere to robots.txt to disallow directory listing #3

Closed rugk closed 4 years ago

rugk commented 4 years ago

I know currently also everyone can add any site to the wiki, but for this new project, should we require some authentication/permission from the site owner to have it appear in the directory?

I can think that some people may not want their instance to be listed publicly (to avoid the load/use etc., because it is a private instance etc.). As this site now makes it very easy to add instances, maybe we do not want that?

Implementation

elrido commented 4 years ago

This is already implemented and some servers already opted out. If the server responds to the check with anything else then 200 or 3xx, we won't add it. Some servers work in the browser, but respond with 403 unauthorized as soon as they see the string "bot" in the user agent. And I purposefully chose to use:

https://github.com/PrivateBin/Directory/blob/f30ebe186af86d845d93254faaa8defd0d379e51/src/models.rs#L229

I have already planned a task to create an info page that documents this and is linked in the above user agent header. This is common practice for bots, as admins may stumble on the odd access pattern of the bot in their logs and then can easily follow the provided link, as the user agent string is usually part of the logs.

Edit: To clarify - such a page should include example configurations on how to opt out of the directory for apache and nginx - something like:

# nginx example
if ($http_user_agent ~ PrivateBinDirectoryBot ) {
   return 403;
}
# apache example
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} PrivateBinDirectoryBot [NC]
RewriteRule . - [R=403,L]
rugk commented 4 years ago

Okay, good idea, then best practice would be to also adhere and check for the robots.txt… Okay that prevents anything by default, but maybe just check whether there is an explicit "disallow" for PrivateBinDirectoryBot?

elrido commented 4 years ago

Good idea: That is easy to add/edit even on older versions and requires no new API. The above webserver mechanism will remain an option and alternatively they can add a section in robots.txt:

User-agent: PrivateBinDirectoryBot
Disallow: /