Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

References removed because of (java.net.SocketException: Connection reset) #316

Closed OkkeKlein closed 7 years ago

OkkeKlein commented 7 years ago

Can this exception maybe not lead to all orphans being removed?

essiembre commented 7 years ago

Are you using version 2.6.x ? By default orphans are always processed (i.e., re-crawled) since 2.6.0, even if a "parent" page generates an exception. The exception to this is if the parent page times out more than once in a row, then it will be "deleted" from cache and its children will be considered orphans. Do you have <orphansStrategy> set to DELETE by any chance?

Child pages are only be considered orphans when the parent page gets deleted. To prevent parent page deletion on timeouts, you can set your "spoiled" reference strategy to IGNORE for pages with a bad status (default is to "grace" them once). Like this (part of crawler config):

  <spoiledReferenceStrategizer>
    <mapping state="BAD_STATUS" strategy="IGNORE" />
  </spoiledReferenceStrategizer>

Have a look at GenericSpoiledReferenceStrategizer

If the above does not help you and you can reproduce easily with your config, can you please share it?

OkkeKlein commented 7 years ago

Yes I am deleting orphans and bad_status was grace once with 2.6.

It seems to me the grace once didn't work as the previous crawl had finished successfully. Or am I misunderstanding and it means a retry within the same crawl session didn't complete successfully?

The reference was not imported, as I only use it for links, if that matters.

Should the logs show something when first attempt failed (grace once)?

essiembre commented 7 years ago

It will not attempt to re-crawl within the same crawl session. You will get an entry in the logs if you set this to DEBUG in log4j.properties:

log4j.logger.com.norconex.collector.core.crawler.AbstractCrawler=DEBUG

The message will look something like this:

[CrawlerName]: this spoiled reference is being graced once [...]

If you frequently get timeouts Putting BAD_STATUS to IGNORE may be your best bet. Pages that no longer exist (404) will still trigger a deletion request on the committer if you have NOT_FOUND to DELETE (which is the default setting).

OkkeKlein commented 7 years ago

Yeah, might as well set it to IGNORE. Thanx.