Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Logical combination of filters #460

Closed FuePi closed 6 years ago

FuePi commented 6 years ago

We need to filter documents without the header Content-Length and with the header Transfer-Encoding set to chunked. This is the importer configuration I came up with:

<importer>
    <preParseHandlers>
        <!--    Exclude pages with header Transfer-Encoding: chunked 
            and header Content-Length missing from being saved to the cache -->
        <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
            onMatch="exclude" field="transfer-encoding">
          .*chunked$
        </filter>
        <filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter"
              onMatch="exclude" fields="content-length" />
    </preParseHandlers>
</importer>

But this seems to exclude documents matching either one of the filters. Is there a way to logically combine these filters?

essiembre commented 6 years ago

Good question. Not sure you can do it the way you have it, but here is a workaround I suggest you try (untested):

<filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter"
        onMatch="exclude" fields="content-length">
    <restrictTo field="transfer-encoding">.*chunked$</restrictTo>
</filter>

The restrictTo makes sure EmptyMetadataFilter is only applicable to documents with "chunked" transfer encoding.

For more control, you can also have a look at the ScriptFilter.

Does that work for you?

FuePi commented 6 years ago

That works perfectly! Thank you very much for the fast response.