Subsequent crawls and changing config?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

Subsequent crawls and changing config? #427

Closed ronjakoi closed 4 years ago

ronjakoi commented 6 years ago

Let's say I crawl my site once, then notice that I crawled a section that I shouldn't have. So I add some exclusion lines to my reference filters.

When I crawl my site again, will the HTTP Collector notice that some references which were crawled before are now excluded and then send a delete command to my Solr? Or will I have to delete the accidentally crawled references myself?

essiembre commented 6 years ago

When links can no longer be reached on subsequent runs (because of config changes or else), they become "orphans". By default orphan URLs are reprocessed but you can change it with this crawler configuration option:

 <orphansStrategy>...</orphansStrategy>

Possible values are:

PROCESS: This is the default. Will try to crawl them again in case they were updated/deleted.
DELETE: Sends deletion requests to your Committer to have them deleted in Solr.
IGNORE: Does nothing with them (i.e., do not attempt to crawl, do not send deletion requests).

ronjakoi commented 6 years ago

If I use the DELETE strategy and do subsequent crawls with varying maxDepth values (like a daily shallow crawl and a weekly deeper crawl), will the REJECTED_TOO_DEEP events of the shallower crawls trigger the orphansStrategy and send delete requests?

essiembre commented 6 years ago

URLs that are no longer reachable are orphans, so yet, they should get deleted. To have shallow vs deep crawls, you can create two collector config with different settings, or, you can look at the GenericRecrawlableResolver. That class allows you to define the minimum elapsed time before some of your documents get recrawled.

ronjakoi commented 6 years ago

I do have two separate collector configs, but they both use the same workDir. I assume that's going to be a problem?

Looking into GenericRecrawlableResolver...

essiembre commented 6 years ago

You are correct. While many written files have unique names, it is highly recommended to avoid to share the same workDir to avoid possible collisions.