Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Importer pipeline error with custom IDocumentFilter #552

Closed RaduCiumag closed 4 years ago

RaduCiumag commented 5 years ago

Hello,

I am trying to use a custom document filter in the Importer pipeline, the Pre-process document stage. My document filter implements only the IDocumentFilter interface. When the pipeline is active, and a document goes through the pre-process filters, an error is generated and the processing for that page is stopped. The error:

INFO  - REJECTED_ERROR             -            REJECTED_ERROR: https://a.ro/somehtml
INFO  - AbstractCrawler            - homezz-crawler: Could not process document: https://a.ro/some.html (ro...filter.XDocumentFilter cannot be cast to com.norconex.importer.handler.filter.IOnMatchFilter)

The source code I can identify as the problem: .m2/repository/com/norconex/collectors/norconex-importer/2.9.0/norconex-importer-2.9.0-sources.jar!/com/norconex/importer/Importer.java:354

boolean accepted = acceptDocument(doc, filter, parsed);
if (isMatchIncludeFilter((IOnMatchFilter) h)) {
    includeResolver.hasIncludes = true;
    if (accepted) {

The cause: When the cast is done, there is not check if h is an instance of IOnMatchFilter

Can you please review this?

Thank you.

essiembre commented 5 years ago

A new snapshot release has been made with this fix. Please test and confirm.