Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Abort sync when condition met #858

Closed alexthamm closed 8 months ago

alexthamm commented 9 months ago

Hey there, is it possible to cancel a crawl if a certain condition is met?

Background We are indexing a source system that requires forms based authentication. This usually works well. However, if the session is aborted for any reason, the system returns the login page with an HTTP 200 status code. The web crawler cannot recognize that it needs to re-authenticate. This has resulted in the index being “cleaned” several times already.

As far as I know, we cannot configure the web crawler to reauthenticate when it detects the login page based on its URL or based on a custom header when it is returned with a 200 status code.

The owner of the source system hesitates to change the response code. However, it would be acceptable if we abort the crawl when the login page is detected, for example via the URL (login.aspx?next=...) or using a specific field in the html header.

Can you suggest any way to interrupt the crawl on purpose to prevent deletions from being sent to the search engine?

Side note: We're using Norconex 2.9.1 and if somehow possible we would like to prevent writing custom code.

Thanks, Alex

sakanaosama commented 9 months ago

Hi Alex,

In response to your inquiry, version 2.9 (and 3.x), there is no inherent feature to initiate a shutdown (or stop crawling) based on page status or metadata/header values. However, you can configure the crawler to activate an external custom script depending on the document hander/metadata/reference. In this scenario, the custom script can then terminate the process.

Nevertheless, if the website accurately returns the status and redirects, 3.x should possess the capability to re-authenticate. This showcases enhanced authentication handling through the utilization of "GenericHttpFetcher/authentication." It is advisable to conduct a test to confirm its functionality.

To gain a clearer understanding of the session expiration and deletion in your case, following some questions:

  1. If the authentication for a valid page expires and the "login page" is returned, does the crawler receive 401(or 404) from the URL of the valid page?
  2. Is it then redirected to (login.aspx?next=...)?
  3. Alternatively, does the valid page not receive a 401/404 at any point?
  4. Otherwise, in the case of the URL of the valid page with expired authentication, is it indexed with the login page "content", maintaining a 200 status?
  5. Does the committer send deleted or updated pages in the search engine
  6. What type of header does it carry after expiration, and is there a potential to use it to identify authentication expiration?

Additionally, consider the search engine in use. There might be a possibility of preventing deletion/update at the engine level by leveraging metadata fields extracted from the header.

If it can be replicated, debug logging would enhance comprehension. Kindly include the logs here.

Thanks -Ryan Ng

alexthamm commented 8 months ago

Hi Ryan, thank you for your answer! I almost missed it :see_no_evil:

We didn't see a way to detect the login page in the connector pipeline and to stop the crawl without writing custom code. Therefore, the authentication approach has been reevaluated and together the customer we have agreed on another solution to this problem: The source system has been adapted so that the connector can authenticate itself in a different way. There is no need to detect the login page anymore and it is more robust against expiring sessions.

I will close this issue. There is no error in the web crawler and if the source system follows the web standards, the authentication also works properly.

Thank you, Alexander