algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
309 stars 107 forks source link

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

Open pixelastic opened 6 years ago

pixelastic commented 6 years ago

A user expected the crawler to respect the <meta name="robots" content="noindex,nofollow"> meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor the robots.txt).

I've always considered DocSearch as an opt-in crawler, so not bound to respect those rules as everything it will crawl or not is configured in the config file that each website owner can edit, so I don't think we should respect this.

That being said, maybe we should introduce a new DocSearch meta tag to exclude pages, to allow owners more fine-grain without requiring to PR their config.

Thoughts @Shipow @s-pace @clemfromspace?

s-pace commented 6 years ago

I do think that creating a dedicated tags would be nice to avoid crawling a small subset of pages. However, let's try to avoid making it a regular practice since it might increase the load of the crawl by downloading more content than required. Providing a dedicated sitemap might be wiser. We will only follow the links from this dedicated documentation sitemap. WDYT?

pixelastic commented 6 years ago

Providing a dedicated sitemap might be wiser.

That could be a great idea. Some kind of docsearch.xml on the root that would list all the pages to crawl. Or maybe there is a way to re-use the standard robots.txt file?

clemfromspace commented 6 years ago

Scrapy have a built-in support for the robots.txt: https://doc.scrapy.org/en/latest/topics/settings.html?highlight=robot#std:setting-ROBOTSTXT_OBEY https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots

Should be easy to add the Scrapy settings here ('ROBOTSTXT_OBEY': True): https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L52

But it will maybe impact existing configurations though.

pixelastic commented 6 years ago

Yep, I think we should not follow robots.txt by default (because changing will not be backward compatible).

My suggestion was that maybe we could reuse the robots.txt syntax to add custom DocSearch information. Maybe something like:

User-agent: DocSearch
Disallow: /dont-index-that-directory/
Disallow: /tmp/
s-pace commented 6 years ago

Good idea but let's put this into the configuration.

Let's wait for the codebase refactor? (migrate to python 3)

clemfromspace commented 6 years ago

Yeah, let's wait for the refactor, we can then add a new middleware inspired by the built-in one from scrapy: https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py#L88

nkuehn commented 3 years ago

from 2018:

Yeah, let's wait for the refactor, we can then add a new middleware

Any updates on this issue? It would be great to at least know a decision from Algolia whether adding support for respecting the no-follow meta headers is intended at all or not.

Shipow commented 3 years ago

Hi @nkuehn As far as I know, nothing is in the pipe regarding this for the moment. Could you give more detail on how this would impact you experience or technical requirement?

nkuehn commented 3 years ago

Sure: Our docs site generator supports a markdown frontmatter that triggers the standard meta=noindex HTML tagging to ensure a given page is not indexed in search engines.

There are varying use cases: pre-release documentation, deprecated features that are just documented as an archive, pages that are just lists or navigation helps and should not appear in search results etc..

These pages are often in that state temporarily and do not follow a specific "regex'able" pattern that we could put into the docsearch config and also, we need immediate control over adding / removing them without having to constantly bother you (the algolia docsearch team) with every individual change with a PR to your configs repo.

We have now understood that docsearch is only relying on whether the page is reachable through crawling. So we are teaching docs authors the different behavior of the on-site search vs. the public search engines and live with some pages appearing in search that we ideally would like to not see there. It's an acceptable situation - something that we absolutely want to hide would not be linked anyways.

TL;DR: The main downside is the additional mental workload for the authors to understand the subtle differences between the behaviors of excluding from "search" (onsite) vs "search"(public). IMHO absolutely acceptable for a free product that is great in all other respect.

PS: I personally think that de-facto standard HTML headers should be respected by a crawler by default and not only via customization. But that's likely rather feedback to scrapy than to docsearch.

Shipow commented 3 years ago

Legit. cc @shortcuts We should have a look a the current state of this. Thanks @nkuehn for taking the time to give more details.