Closed cjurewicz closed 2 years ago
You are correct that if you wipe out the crawl store, it loses track of whether any URL previously existed or not and can't send a deletion request. It is the same as doing a brand new crawl.
With version 2.x you'll have to use workarounds (avoiding custom coding). Obviously cleaning your index would do it but it is rather drastic. Another approach when doing full crawls (i.e., wiping crawl store), is to also index a date field holding the current date (see CurrentDateTagger). You can also define a made up constant field to do the same, that you update when doing full crawls (see ConstantTagger. Then you can rely on this new date/constant field to delete from your index anything older or not having your new constant.
Version 3 introduced the ability to send deletion requests for any triggered crawler events (REJECTED_NOTFOUND
in your case) regardless of the crawl store. When you'll be ready to upgrade, have a look at DeleteRejectedEventListener.
Thank you Pierre for you quick response and insightful wisdom. It is very much appreciated!
Did you mean "Pascal"? 😉
Sorry Pascal. I blame fat fingers and predictive text!
I am running 7.12.0 of ElasticSearch and 2.9.0 of Norconex. I have some web pages in my index that are no longer there; they result in a 404 error. The problem is that my 404 entry pages in the index are not being deleted. I decided to run a full crawl by deleting my crawlstore to fix this problem. As I understand it, the crawlstore determines whether a page should be marked for deletion. But because the crawlstore was created by a full crawl, it does not know about these 404 entries, and as a result never marks them for deletion. What am I missing, and how do I remedy this? Any help would be greatly appreciated.