Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Order of filter rejection reasons #85

Closed Pittiplatsch closed 5 years ago

Pittiplatsch commented 5 years ago

Hello Pascal,

thank you very much for your product you obviously put so much effort into šŸ‘

On refining my importer configurations after some first runs, I added a filter which effectively excludes images.

With this modification in place, after re-running the crawler I expected images which had been crawled during the first (unrestricted) runs to disappear from my (Solr) index.

However, a substantial number of images remained.

On investigation of my logs I stumbled on this part:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://__removed__/Captcha8.png?hash=12345
INFO  [CrawlerEventManager]           REJECTED_IMPORT: https://__removed__/Captcha8.png?hash=12345 (ImporterResponse[reference
=https://__removed__/Captcha8.png?hash=12345,status=ImporterStatus[status=REJECTED,filter=<null>,exception=<null>,description=
None of the filters with onMatch being INCLUDE got matched.],doc=<null>,nestedResponses=[]])
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://__removed__/abc205.jpg
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: https://__removed__/abc205.jpg

The first image actually got removed (as proposed in the log above), whilst the second one remained, although it should have been rejected by the same filter as well.

I suppose a problem with the priority of filter rejection reasons, where the REJECTED_UNMODIFIED status prevents filters to be executed at all, resulting in a missing reason REJECTED_IMPORT which in turn doesn't trigger the intended orphan deletion.

Could you please check that?

Aside: I obviously have the orphan strategy set to DELETE; as stated the first image (captcha) was successfully deleted.

Thank you very much. Kind regards, Lars

Edit: If I were a Java programmer, I'd just provide a failing unit test... ;-)

essiembre commented 5 years ago

Rejected documents are not considered orphans if they are still reachable during crawling. So I am afraid for now you will have to manually delete them from your Solr installation. There is a feature request for sending deletion requests to your committer for rejected documents here: https://github.com/Norconex/collector-http/issues/211

Unless what you are asking is different, I would close this issue in favor of the existing one.

Pittiplatsch commented 5 years ago

Hi Pascal,

that's exactly what I am looking for. Luckily, my (new) filter is trivial enough to transfer it into a Solr deletion query.

I'll watch Norconex/collector-http#211 instead.

Thank you šŸ‘