Closed alexthamm closed 8 months ago
Hi Alex,
In response to your inquiry, version 2.9 (and 3.x), there is no inherent feature to initiate a shutdown (or stop crawling) based on page status or metadata/header values. However, you can configure the crawler to activate an external custom script depending on the document hander/metadata/reference. In this scenario, the custom script can then terminate the process.
Nevertheless, if the website accurately returns the status and redirects, 3.x should possess the capability to re-authenticate. This showcases enhanced authentication handling through the utilization of "GenericHttpFetcher/authentication." It is advisable to conduct a test to confirm its functionality.
To gain a clearer understanding of the session expiration and deletion in your case, following some questions:
Additionally, consider the search engine in use. There might be a possibility of preventing deletion/update at the engine level by leveraging metadata fields extracted from the header.
If it can be replicated, debug logging would enhance comprehension. Kindly include the logs here.
Thanks -Ryan Ng
Hi Ryan, thank you for your answer! I almost missed it :see_no_evil:
We didn't see a way to detect the login page in the connector pipeline and to stop the crawl without writing custom code. Therefore, the authentication approach has been reevaluated and together the customer we have agreed on another solution to this problem: The source system has been adapted so that the connector can authenticate itself in a different way. There is no need to detect the login page anymore and it is more robust against expiring sessions.
I will close this issue. There is no error in the web crawler and if the source system follows the web standards, the authentication also works properly.
Thank you, Alexander
Hey there, is it possible to cancel a crawl if a certain condition is met?
Background We are indexing a source system that requires forms based authentication. This usually works well. However, if the session is aborted for any reason, the system returns the login page with an HTTP 200 status code. The web crawler cannot recognize that it needs to re-authenticate. This has resulted in the index being “cleaned” several times already.
As far as I know, we cannot configure the web crawler to reauthenticate when it detects the login page based on its URL or based on a custom header when it is returned with a 200 status code.
The owner of the source system hesitates to change the response code. However, it would be acceptable if we abort the crawl when the login page is detected, for example via the URL (login.aspx?next=...) or using a specific field in the html header.
Can you suggest any way to interrupt the crawl on purpose to prevent deletions from being sent to the search engine?
Side note: We're using Norconex 2.9.1 and if somehow possible we would like to prevent writing custom code.
Thanks, Alex