Closed davidfordaus closed 3 years ago
From what I understand, the PROCESS will not delete or ignore, but it will "Reprocessing any cached/orphan references..." according to the log info.
@somphouang is correct. For all of them:
An orphan does not mean the page is no longer there, but either it was no longer referenced. When it is no longer referenced it "may" be because it was removed from the site (404) but not necessarily. That is why we give the option. The orphan strategy is only applicable to subsequent crawls. When the crawler is done crawling "normally" (i.e., following links as they are discovered), any URLs that were crawled in the previous crawl and not this last one are considered orphans.
IGNORE - Like you said: do nothing. This means no requests will be sent to your Committer and it won't be recrawled.
DELETE - If the page is no longer "referenced" by any pages encountered during the crawl (and it was in the previous crawl), then generate a DELETE to the committer (regardless whether the page still exists and is valid or not).
PROCESS - The orphan URL (leftover from the previous crawl) will be recrawled as if it was discovered during regular crawling. Then it is treated just like any URLs. I.e., it will go through the same flow, leading to either rejection, or a call to your committer (for update, or deletion).
Clearer?
Hi @essiembre,
Thanks for the detailed explanation. I would have an additional question on the expected behavior here: When in the PROCESS mode an orphan URL gets reprocessed and ends up as REJECTED_REDIRECTED, that URL would get deleted from the index (via the committer), correct? That would be what I would expect at least, or are there other settings that need to exist for this to happen?
Thanks!
Thanks for the response - I'll add my own clarifications as below. Can you confirm that they're correct please?
IGNORE - (obvious) DELETE - The page may still exist and be readable, however if it isn't referenced by other pages then it will be deleted. PROCESS - Re-try any pages that were found in a previous crawl and delete or update as appropriate. The page doesn't need to be referenced by any current page in the website.
So as a short TLDR:
Hi Patrick / Norconex - are you able to provide detail about the above request please,
Importantly the statement related to "IF FOUND and REDIRECTED" - it's a little unclear what happens when the crawler detects a redirect of an item that's no longer referenced.
@davidfordaus, your understanding is correct. @RonaldKepken, the redirect scenario you described is not the default behavior. If a URL was previously valid (200) and is now redirected, it will be rejected in favor of its target, but would not be deleted (if the target is 404, that target will be deleted).
If you know your new target URL will be reached some other way, you can then treat the redirect URLs as equivalent to "Not Found" and then they should be deleted if they were ok on the previous round.
Assuming you are using version 2.x, here is how you can do it using the GenericDocumentFetcher
:
<documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher">
<notFoundStatusCodes>404,301,302</notFoundStatusCodes>
</documentFetcher>
Thanks Patrick - that clarifies the situation regarding redirects, and I appreciate the confirmation of the non-redirect situation. Closed from my side now.
The documentation on the orphanStrategy is very ambiguous at the moment. The statuses appear to be:
IGNORE - do nothing DELETE - if a path is no longer found then generate a DELETE to the committer PROCESS - I'm very unsure under what circumstances this ever generates a delete?
Are you able to provide details of the circumstances for deletions with the default PROCESS setting please?