Closed cabal95 closed 6 years ago
Can you provide a bit more info on your specific use case. I'm not sure I understand this line:
The Rock Site Crawler that came with Universal Search will not follow links if the robots noindex option is specified.
Where is this specified?
Can you provide a bit more info on your specific use case.
We have about 300 calendar events. But only a handful will be indexed because the Events page, the one with the mini calendar and short descriptions of the events, will only show the events for the current month. Unless the user clicks the little arrow to go to the next month. This is done via PostBack which means the crawler cannot get to this information, since it is a Javascript link. In fact only about 25 of these events showed up on the index after crawling the site.
So a concrete example would be, come October we add calendar events for our Christmas program in December. Unless we have these calendar items links from somewhere else (like the homepage), the site crawler will not find them until December rolls around and the Events page now shows those items on the initial page load.
Where is this specified?
https://github.com/SparkDevNetwork/Rock/blob/develop/Rock/UniversalSearch/Crawler/Crawler.cs#L156
So for example, with a normal site crawler (e.g. Google), if I don't want that specific page to be indexed I would add <meta name="robots" content="noindex" />
. That would mean "do not index the content on this page, but do follow any links". If I wanted the search engine/crawler to not index and not follow links, I would specify: <meta name="robots" content="noindex, nofollow" />
to indicate "do not index, and do not follow links".
Currently, the Rock Site Crawler will not follow links if noindex
is specified, which is incorrect behavior.
Edited some text for clarity of meaning
Ok, that makes more sense. I was thinking you may be talking about a robot.txt file. OK to PR.
Prerequisites
Description
The Rock Site Crawler that came with Universal Search will not follow links if the robots
noindex
option is specified. This is incorrect, links should be followed unless the robots meta specifies thenofollow
option, or the link itself has arel="nofollow"
option.I can submit a PR for this.
Suggested Action
Add support for the
nofollow
flag in the robots meta tag. At the same time update theParseLinks
method to check forrel="nofollow"
in the link and if found skip that individual link.Expected behavior:
I want to build a page that has links to other pages that should be indexed but would not normally be found during a site crawl (example, event pages whose links only show up after clicking a PostBack button, which cannot be indexed).
Additionally, there are a few pages on the site that we don't want indexed because they are little more than menu/link-only pages.
Actual behavior:
These link-only pages are indexed because I cannot have the crawler follow links but not index the page itself.
Versions