Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

OrphanStrategy - DELETE, PROCESS, IGNORE - under what circumstances does PROCESS remove entries? #722

Closed davidfordaus closed 3 years ago

davidfordaus commented 3 years ago

The documentation on the orphanStrategy is very ambiguous at the moment. The statuses appear to be:

IGNORE - do nothing DELETE - if a path is no longer found then generate a DELETE to the committer PROCESS - I'm very unsure under what circumstances this ever generates a delete?

Are you able to provide details of the circumstances for deletions with the default PROCESS setting please?

somphouang commented 3 years ago

From what I understand, the PROCESS will not delete or ignore, but it will "Reprocessing any cached/orphan references..." according to the log info.

essiembre commented 3 years ago

@somphouang is correct. For all of them:

An orphan does not mean the page is no longer there, but either it was no longer referenced. When it is no longer referenced it "may" be because it was removed from the site (404) but not necessarily. That is why we give the option. The orphan strategy is only applicable to subsequent crawls. When the crawler is done crawling "normally" (i.e., following links as they are discovered), any URLs that were crawled in the previous crawl and not this last one are considered orphans.

IGNORE - Like you said: do nothing. This means no requests will be sent to your Committer and it won't be recrawled.

DELETE - If the page is no longer "referenced" by any pages encountered during the crawl (and it was in the previous crawl), then generate a DELETE to the committer (regardless whether the page still exists and is valid or not).

PROCESS - The orphan URL (leftover from the previous crawl) will be recrawled as if it was discovered during regular crawling. Then it is treated just like any URLs. I.e., it will go through the same flow, leading to either rejection, or a call to your committer (for update, or deletion).

Clearer?

RonaldKepken commented 3 years ago

Hi @essiembre,

Thanks for the detailed explanation. I would have an additional question on the expected behavior here: When in the PROCESS mode an orphan URL gets reprocessed and ends up as REJECTED_REDIRECTED, that URL would get deleted from the index (via the committer), correct? That would be what I would expect at least, or are there other settings that need to exist for this to happen?

Thanks!

davidfordaus commented 3 years ago

Thanks for the response - I'll add my own clarifications as below. Can you confirm that they're correct please?

IGNORE - (obvious) DELETE - The page may still exist and be readable, however if it isn't referenced by other pages then it will be deleted. PROCESS - Re-try any pages that were found in a previous crawl and delete or update as appropriate. The page doesn't need to be referenced by any current page in the website.

So as a short TLDR:

davidfordaus commented 3 years ago

Hi Patrick / Norconex - are you able to provide detail about the above request please,

Importantly the statement related to "IF FOUND and REDIRECTED" - it's a little unclear what happens when the crawler detects a redirect of an item that's no longer referenced.

essiembre commented 3 years ago

@davidfordaus, your understanding is correct. @RonaldKepken, the redirect scenario you described is not the default behavior. If a URL was previously valid (200) and is now redirected, it will be rejected in favor of its target, but would not be deleted (if the target is 404, that target will be deleted).

If you know your new target URL will be reached some other way, you can then treat the redirect URLs as equivalent to "Not Found" and then they should be deleted if they were ok on the previous round.

Assuming you are using version 2.x, here is how you can do it using the GenericDocumentFetcher:

  <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher">
    <notFoundStatusCodes>404,301,302</notFoundStatusCodes>
  </documentFetcher>
davidfordaus commented 3 years ago

Thanks Patrick - that clarifies the situation regarding redirects, and I appreciate the confirmation of the non-redirect situation. Closed from my side now.