Site Crawler incorrectly skips links when robots contains noindex

cabal95 commented 6 years ago

Prerequisites

[X] Put an X between the brackets on this line if you have done all of the following:
- Can you reproduce the problem on a fresh install or the demo site?
- Did you include your Rock version number and client culture setting?
- Did you perform a cursory search to see if your bug or enhancement is already reported?

Description

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified. This is incorrect, links should be followed unless the robots meta specifies the nofollow option, or the link itself has a rel="nofollow" option.

I can submit a PR for this.

Suggested Action

Add support for the nofollow flag in the robots meta tag. At the same time update the ParseLinks method to check for rel="nofollow" in the link and if found skip that individual link.

Expected behavior:

I want to build a page that has links to other pages that should be indexed but would not normally be found during a site crawl (example, event pages whose links only show up after clicking a PostBack button, which cannot be indexed).

Additionally, there are a few pages on the site that we don't want indexed because they are little more than menu/link-only pages.

Actual behavior:

These link-only pages are indexed because I cannot have the crawler follow links but not index the page itself.

Versions

Rock Version: 7.3
Client Culture Setting: en-US

jonedmiston commented 6 years ago

Can you provide a bit more info on your specific use case. I'm not sure I understand this line:

The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified.

Where is this specified?

cabal95 commented 6 years ago

Can you provide a bit more info on your specific use case.

We have about 300 calendar events. But only a handful will be indexed because the Events page, the one with the mini calendar and short descriptions of the events, will only show the events for the current month. Unless the user clicks the little arrow to go to the next month. This is done via PostBack which means the crawler cannot get to this information, since it is a Javascript link. In fact only about 25 of these events showed up on the index after crawling the site.

So a concrete example would be, come October we add calendar events for our Christmas program in December. Unless we have these calendar items links from somewhere else (like the homepage), the site crawler will not find them until December rolls around and the Events page now shows those items on the initial page load.

Where is this specified?

https://github.com/SparkDevNetwork/Rock/blob/develop/Rock/UniversalSearch/Crawler/Crawler.cs#L156

So for example, with a normal site crawler (e.g. Google), if I don't want that specific page to be indexed I would add <meta name="robots" content="noindex" />. That would mean "do not index the content on this page, but do follow any links". If I wanted the search engine/crawler to not index and not follow links, I would specify: <meta name="robots" content="noindex, nofollow" /> to indicate "do not index, and do not follow links".

Currently, the Rock Site Crawler will not follow links if noindex is specified, which is incorrect behavior.

Edited some text for clarity of meaning

jonedmiston commented 6 years ago

Ok, that makes more sense. I was thinking you may be talking about a robot.txt file. OK to PR.

SparkDevNetwork / Rock