Closed Pittiplatsch closed 5 years ago
Rejected documents are not considered orphans if they are still reachable during crawling. So I am afraid for now you will have to manually delete them from your Solr installation. There is a feature request for sending deletion requests to your committer for rejected documents here: https://github.com/Norconex/collector-http/issues/211
Unless what you are asking is different, I would close this issue in favor of the existing one.
Hi Pascal,
that's exactly what I am looking for. Luckily, my (new) filter is trivial enough to transfer it into a Solr deletion query.
I'll watch Norconex/collector-http#211 instead.
Thank you š
Hello Pascal,
thank you very much for your product you obviously put so much effort into š
On refining my importer configurations after some first runs, I added a filter which effectively excludes images.
With this modification in place, after re-running the crawler I expected images which had been crawled during the first (unrestricted) runs to disappear from my (Solr) index.
However, a substantial number of images remained.
On investigation of my logs I stumbled on this part:
The first image actually got removed (as proposed in the log above), whilst the second one remained, although it should have been rejected by the same filter as well.
I suppose a problem with the priority of filter rejection reasons, where the
REJECTED_UNMODIFIED
status prevents filters to be executed at all, resulting in a missing reasonREJECTED_IMPORT
which in turn doesn't trigger the intended orphan deletion.Could you please check that?
Aside: I obviously have the orphan strategy set to
DELETE
; as stated the first image (captcha) was successfully deleted.Thank you very much. Kind regards, Lars
Edit: If I were a Java programmer, I'd just provide a failing unit test... ;-)