Ensure docs and readme specifically note intention for "Restricted site access" / "Discourage search engine" functionality

10up / restricted-site-access

Limit access to visitors who are logged in or allowed by IP addresses. Includes many options for handling blocked visitors.

http://10up.com/plugins/restricted-site-access-wordpress/

GNU General Public License v2.0

231 stars 48 forks source link

Ensure docs and readme specifically note intention for "Restricted site access" / "Discourage search engine" functionality #64

Closed lkraav closed 1 year ago

lkraav commented 5 years ago

Goal: ability to selectively combine effects of

"Discourage search engines from indexing this site"
"Restrict site access to visitors who are logged in or allowed by IP address"

Expected /robots.txt output:

User-agent: *
Disallow: /

It's almost like "Restricted site access" or "Discourage search engine" should become an "add-on" type checkbox, instead of "pick one" radio button.

Your thoughts?

EDIT I wonder if there is a clever way of hooking into https://github.com/WordPress/WordPress/blob/5.0.3/wp-includes/functions.php#L1314

lkraav commented 5 years ago

Writing a simple integration override plugin, it dawned on me that it would be useful if magic number $blog_public 2 would be defined as a class variable, so not only inside code, but also outsiders could reference it consistently.

jeffpaul commented 5 years ago

@lkraav could you help provide a use case where this would be useful? If I'm understanding your request correctly, then I'm unable to foresee a scenario where I'd want to restrict visitors by IP while letting search bots index the site (if that's truly what you're asking).

lkraav commented 5 years ago

I'm unable to foresee a scenario where I'd want to restrict visitors by IP while letting search bots index the site (if that's truly what you're asking).

But that's exactly what's currently happening, if you filter robots query variable to get through restriction.

Of course, with no filter in place, robots.txt queries get redirected to whatever the configuration, but I believe that's not optimal either.

jeffpaul commented 5 years ago

@lkraav after internal discussion, our intention is to keep the functionality of Restricted Site Access as-is in relation to this issue. We'll work to update our documentation and readme files in a related PR to more clearly reflect this. Thank you for calling this to our attention, you're helping to ensure we're best representing Restricted Site Access to the community!

iSpaceGitHub commented 5 years ago

My goal using the Restricted Site Access plugin was to a) restrict site access based on IP address, and b) stop search engines crawling the website. I was surprised to find out that A) was working fine, but the website was still being crawled by Google. Each search result then ended in a 404. In my opinion it would be much better if the site was not crawled at all, or that you have a choice (although I cannot think of a situation where you would want your full site under restricted access, but still crawled). Do I understand that you've decided to leave things as-is? For what purpose would that be?

helen commented 5 years ago

Seems like that Disallow directive will be removed in WordPress itself soon: https://make.wordpress.org/core/2019/09/02/changes-to-prevent-search-engines-indexing-sites/

Perhaps we should think about the meta tag a bit more? Not sure it that's actually better covered by something like an SEO plugin though. Would be worthy of a separate issue to discuss.

dinhtungdu commented 4 years ago

As I understand, Google will remove redirected URLs from its results eventually. But to achieve that we need to allow Google to index the site, it needs to know about redirects we have. So RSA enabled sites shouldn't be controlled by robots.txt which prevents search engines from detecting the redirections.

Let's examine all RSA options:

Send them to the WordPress login screen and Show them a simple message: These two pages have meta robots set to noindex.
Redirect them to a specified web address: The search engine visibility depends on the redirected site.
Show them a page: This page is indexable by search engines (for now). IMO, this is the only place we need to control the search engine visibility by meta tag.

but the website was still being crawled by Google

For this, the only case I can think about is: the site is crawled before and URLs are cached by Google.