Closed ronjakoi closed 4 years ago
When links can no longer be reached on subsequent runs (because of config changes or else), they become "orphans". By default orphan URLs are reprocessed but you can change it with this crawler configuration option:
<orphansStrategy>...</orphansStrategy>
Possible values are:
PROCESS
: This is the default. Will try to crawl them again in case they were updated/deleted.DELETE
: Sends deletion requests to your Committer to have them deleted in Solr.IGNORE
: Does nothing with them (i.e., do not attempt to crawl, do not send deletion requests).If I use the DELETE
strategy and do subsequent crawls with varying maxDepth
values (like a daily shallow crawl and a weekly deeper crawl), will the REJECTED_TOO_DEEP
events of the shallower crawls trigger the orphansStrategy
and send delete requests?
URLs that are no longer reachable are orphans, so yet, they should get deleted. To have shallow vs deep crawls, you can create two collector config with different settings, or, you can look at the GenericRecrawlableResolver. That class allows you to define the minimum elapsed time before some of your documents get recrawled.
I do have two separate collector configs, but they both use the same workDir
. I assume that's going to be a problem?
Looking into GenericRecrawlableResolver
...
You are correct. While many written files have unique names, it is highly recommended to avoid to share the same workDir to avoid possible collisions.
Let's say I crawl my site once, then notice that I crawled a section that I shouldn't have. So I add some exclusion lines to my reference filters.
When I crawl my site again, will the HTTP Collector notice that some references which were crawled before are now excluded and then send a delete command to my Solr? Or will I have to delete the accidentally crawled references myself?