Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

3.0.0-M2 release: ready for testing #760

Closed essiembre closed 2 years ago

essiembre commented 3 years ago

A second milestone release of the HTTP Collector was just made. One of the most significant features comes from the Importer module and allows you to better control the Importer handler flow via XML configuration.

It aims to replace the <restrictTo ...> options applied to each handler and the Importer "filter" handlers. It is better explained with a before/after example from the <importer> section of a configuration file...

Before:

    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
      <restrictTo field="document.reference">.*/hr/.*</restrictTo>
      <constant name="department">Human Resources</constant>
    </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <restrictTo field="document.reference">.*/hr/.*</restrictTo>
      <dom selector=".contact .name" toField="fullName" />
    </tagger>

After:

    <if>
      <condition class="ReferenceCondition">
        <valueMatcher>.*/hr/.*</valueMatcher>
      </condition>
      <then>
        <handler class="ConstantTagger">
          <constant name="department">Human Resources</constant>
        </handler>
        <handler class="DOMTagger">
          <dom selector=".contact .name" toField="fullName" />
        </handler>
      </then>
    </if>

For very simple use cases it can be a bit more verbose, but you can see it allows grouping multiple handlers under the same condition. There are several other limitations to "restrictTo" that are being addressed by this new approach. In addition to the above example...

Here is a more complex example to better illustrate the above options:

      <if>
        <conditions operator="AND">
          <condition class="ScriptCondition">
            metadata.getString('document.contentFamily') === 'html';
          </condition>
          <condition class="NumericCondition">
            <fieldMatcher>Content-Length</fieldMatcher>
            <valueMatcher operator="GE" number="10000"/>
            <valueMatcher operator="LT" number="20000"/>
          </condition>
        </conditions>
        <then>
          <handler class="ConstantTagger">
            <constant name="htmlDocSize">10K-20K</constant>
          </handler>
        </then>
        <else>
          <ifNot>
            <condition>
              <fieldMatcher>document.contentType</fieldMatcher>
              <valueMatcher method="wildcard">text/*</valueMatcher>
            </condition>
          </ifNot>
          <then>
            <reject/>
          </then>
        </else>
      </if>

Open questions:

What's next?

This should be the last milestone release before the release candidate. To help us release it sooner, here is how you can help:

Happy crawling!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.