Order of filter rejection reasons

Pittiplatsch commented 5 years ago

Hello Pascal,

thank you very much for your product you obviously put so much effort into 👍

On refining my importer configurations after some first runs, I added a filter which effectively excludes images.

With this modification in place, after re-running the crawler I expected images which had been crawled during the first (unrestricted) runs to disappear from my (Solr) index.

However, a substantial number of images remained.

On investigation of my logs I stumbled on this part:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]           REJECTED_IMPORT: https://__removed__/Captcha8.png?hash=12345 (ImporterResponse[reference
=https://__removed__/Captcha8.png?hash=12345,status=ImporterStatus[status=REJECTED,filter=<null>,exception=<null>,description=
None of the filters with onMatch being INCLUDE got matched.],doc=<null>,nestedResponses=[]])
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: https://__removed__/abc205.jpg

The first image actually got removed (as proposed in the log above), whilst the second one remained, although it should have been rejected by the same filter as well.

I suppose a problem with the priority of filter rejection reasons, where the REJECTED_UNMODIFIED status prevents filters to be executed at all, resulting in a missing reason REJECTED_IMPORT which in turn doesn't trigger the intended orphan deletion.

Could you please check that?

Aside: I obviously have the orphan strategy set to DELETE; as stated the first image (captcha) was successfully deleted.

Thank you very much. Kind regards, Lars

Edit: If I were a Java programmer, I'd just provide a failing unit test... ;-)

essiembre commented 5 years ago

Rejected documents are not considered orphans if they are still reachable during crawling. So I am afraid for now you will have to manually delete them from your Solr installation. There is a feature request for sending deletion requests to your committer for rejected documents here: https://github.com/Norconex/collector-http/issues/211

Unless what you are asking is different, I would close this issue in favor of the existing one.

Pittiplatsch commented 5 years ago

Hi Pascal,

that's exactly what I am looking for. Luckily, my (new) filter is trivial enough to transfer it into a Solr deletion query.

I'll watch Norconex/collector-http#211 instead.

Thank you 👍

Norconex / importer

Order of filter rejection reasons #85