Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
A second milestone release of the HTTP Collector was just made. One of the most significant features comes from the Importer module and allows you to better control the Importer handler flow via XML configuration.
It aims to replace the <restrictTo ...> options applied to each handler and the Importer "filter" handlers. It is better explained with a before/after example from the <importer> section of a configuration file...
For very simple use cases it can be a bit more verbose, but you can see it allows grouping multiple handlers under the same condition. There are several other limitations to "restrictTo" that are being addressed by this new approach. In addition to the above example...
Multiple conditions can be grouped under a <conditions operator="[AND|OR]"> tag (where they should all be matched, or any should be matched -- AND vs OR).
While being the default condition (TextMatcher), conditions are no longer limited to text matching. They can be anything, starting with these.
Negations are much easier to perform thanks to <if> vs </ifNot> as well as <then> vs <else>.
Conditions can be nested.
The <then> or <else> clauses can contain a mix of nested conditions and handlers.
Filter handlers are no longer necessary thanks to using conditions along with the special <reject/> tag.
Here is a more complex example to better illustrate the above options:
Given the above being a lot more flexible, is there value in keeping the <restrictTo> around for Importer handlers? We are considering deprecating them.
For the same reason, we are also considering deprecating the filter handlers in favor of conditions + <reject/>. Any value in keeping them around in the long run?
What's next?
This should be the last milestone release before the release candidate. To help us release it sooner, here is how you can help:
Try out this new milestone, and more specifically the new Importer XML-flow configuration options. Give your feedback and report issues (make sure to mention this version)
Documentation is still being put together. Share areas where you struggled the most, or you feel deserve more documentation and examples.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
A second milestone release of the HTTP Collector was just made. One of the most significant features comes from the Importer module and allows you to better control the Importer handler flow via XML configuration.
It aims to replace the
<restrictTo ...>
options applied to each handler and the Importer "filter" handlers. It is better explained with a before/after example from the<importer>
section of a configuration file...Before:
After:
For very simple use cases it can be a bit more verbose, but you can see it allows grouping multiple handlers under the same condition. There are several other limitations to "restrictTo" that are being addressed by this new approach. In addition to the above example...
<conditions operator="[AND|OR]">
tag (where they should all be matched, or any should be matched -- AND vs OR).<if>
vs</ifNot>
as well as<then>
vs<else>
.<then>
or<else>
clauses can contain a mix of nested conditions and handlers.<reject/>
tag.Here is a more complex example to better illustrate the above options:
Open questions:
<restrictTo>
around for Importer handlers? We are considering deprecating them.<reject/>
. Any value in keeping them around in the long run?What's next?
This should be the last milestone release before the release candidate. To help us release it sooner, here is how you can help:
Happy crawling!