Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Deleting orphan entries from my index #779

Closed cjurewicz closed 2 years ago

cjurewicz commented 2 years ago

I am running 7.12.0 of ElasticSearch and 2.9.0 of Norconex. I have some web pages in my index that are no longer there; they result in a 404 error. The problem is that my 404 entry pages in the index are not being deleted. I decided to run a full crawl by deleting my crawlstore to fix this problem. As I understand it, the crawlstore determines whether a page should be marked for deletion. But because the crawlstore was created by a full crawl, it does not know about these 404 entries, and as a result never marks them for deletion. What am I missing, and how do I remedy this? Any help would be greatly appreciated.

essiembre commented 2 years ago

You are correct that if you wipe out the crawl store, it loses track of whether any URL previously existed or not and can't send a deletion request. It is the same as doing a brand new crawl.

With version 2.x you'll have to use workarounds (avoiding custom coding). Obviously cleaning your index would do it but it is rather drastic. Another approach when doing full crawls (i.e., wiping crawl store), is to also index a date field holding the current date (see CurrentDateTagger). You can also define a made up constant field to do the same, that you update when doing full crawls (see ConstantTagger. Then you can rely on this new date/constant field to delete from your index anything older or not having your new constant.

Version 3 introduced the ability to send deletion requests for any triggered crawler events (REJECTED_NOTFOUND in your case) regardless of the crawl store. When you'll be ready to upgrade, have a look at DeleteRejectedEventListener.

cjurewicz commented 2 years ago

Thank you Pierre for you quick response and insightful wisdom. It is very much appreciated!

essiembre commented 2 years ago

Did you mean "Pascal"? 😉

cjurewicz commented 2 years ago

Sorry Pascal. I blame fat fingers and predictive text!