Closed OkkeKlein closed 7 years ago
Are you using version 2.6.x ? By default orphans are always processed (i.e., re-crawled) since 2.6.0, even if a "parent" page generates an exception. The exception to this is if the parent page times out more than once in a row, then it will be "deleted" from cache and its children will be considered orphans. Do you have <orphansStrategy>
set to DELETE by any chance?
Child pages are only be considered orphans when the parent page gets deleted. To prevent parent page deletion on timeouts, you can set your "spoiled" reference strategy to IGNORE for pages with a bad status (default is to "grace" them once). Like this (part of crawler config):
<spoiledReferenceStrategizer>
<mapping state="BAD_STATUS" strategy="IGNORE" />
</spoiledReferenceStrategizer>
Have a look at GenericSpoiledReferenceStrategizer
If the above does not help you and you can reproduce easily with your config, can you please share it?
Yes I am deleting orphans and bad_status was grace once with 2.6.
It seems to me the grace once didn't work as the previous crawl had finished successfully. Or am I misunderstanding and it means a retry within the same crawl session didn't complete successfully?
The reference was not imported, as I only use it for links, if that matters.
Should the logs show something when first attempt failed (grace once)?
It will not attempt to re-crawl within the same crawl session. You will get an entry in the logs if you set this to DEBUG in log4j.properties:
log4j.logger.com.norconex.collector.core.crawler.AbstractCrawler=DEBUG
The message will look something like this:
[CrawlerName]: this spoiled reference is being graced once [...]
If you frequently get timeouts Putting BAD_STATUS to IGNORE may be your best bet. Pages that no longer exist (404) will still trigger a deletion request on the committer if you have NOT_FOUND to DELETE (which is the default setting).
Yeah, might as well set it to IGNORE. Thanx.
Can this exception maybe not lead to all orphans being removed?