CanastaWiki / Canasta

MediaWiki Docker image for Canasta, an all-in-one MediaWiki stack for easy deployment and management of enterprise-ready MediaWiki on production environments.
https://www.canasta.wiki
MIT License
36 stars 27 forks source link

The robots.txt does not allow to index /w/images and /w/sitemap #314

Closed pastakhov closed 6 months ago

pastakhov commented 11 months ago

I think it is better when crawlers can index images (Allow: /w/thumb.php? should be included also) If robots.txt does not allow access to /w/sitemap, and the crawler can't access the sitemap files.

Probably /w/index.php? should be allowed also. If I'm not wrong, all HTML in /w/index.php? URL contains the <meta name="robots" content="noindex,nofollow"> tag, but crawlers want to be able to check them anyway. This a question for an SEO specialist. I just saw that Google's crawler complained about when it was not allowed to scan the pages.

vedmaka commented 11 months ago

Google likes to complain, I would keep things aligned with the official instructions https://www.mediawiki.org/wiki/Manual:Robots.txt , plus allowing crawling of /w/index.php? may induce unnecessary load to the wiki which bots trying to load various diffs and history pages which are usually resource heave

I agree that images and thumbs can be whitelisted

pastakhov commented 7 months ago

allowing crawling of /w/index.php? may induce unnecessary load to the wiki which bots trying to load various diffs and history pages which are usually resource heave

I agree