PrivateBin / Directory

Rust based directory application to collect list of federated instances
https://privatebin.info/directory/
Other
25 stars 8 forks source link

Take changes of the robots.txt into consideration #15

Closed elrido closed 4 years ago

elrido commented 4 years ago

So we have currently one instance in the directory that at this time has a robots.txt that announces (as per #3) that it doesn't want to be included in the directory. In the cron log it causes a message like this:

Instance https://example.com failed to be checked with error: Web server on URL https://example.com doesn't want to get added to the directory.

When investigating why this got added in the first place, I started a local instance using make run and tried to add it and it did get denied and refused to be added. The cron doesn't remove the instance later on, just detects this. This, obviously should be improved, as it is totally possible that an admin changes their mind and later wants their instance to be removed. The cron detecting this case, should trigger an instant deletion.


I had removed the instance in question manually from the database and noticed that a few days later it had reappeared in the log. So someone adds it, but then why does it pass the test on the server initially, but gets denied later on and on my local test instance? Some more trials locally let me eventually reproduce the problem every few requests. The site uses cloudflare and my guess is, that in some requests I get a version of the robots.txt that allows the use and in some not, so maybe some cloudflare caches still serve the original version.

With the above described change, we will remove the instance eventually, but it may get readded a few more times, until the clownflare caches become coherent. If you run the instance in question and wondered about how it got added despite the robots.txt saying otherwise - Sorry about that!