Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Validation vs. Documentation? #362

Closed liar666 closed 7 years ago

liar666 commented 7 years ago

I originally wrote a crawler with:

       <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
          https://(find|openresearch-repository|digitalcollections)[.]anu[.]edu[.]au/(handle/[0-9]+/1/)simple-search([?].*)?
        </filter>

Running collector-http.sh --checkcfg resulted in:

ERROR (XML Validation) RegexMetadataFilter: cvc-complex-type.2.3: Element 'filter' cannot have character [children], because the type's content type is element-only.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexMetadataFilter: cvc-complex-type.2.3: Element 'filter' cannot have character [children], because the type's content type is element-only.
ERROR (XML Validation) RegexMetadataFilter: cvc-complex-type.2.4.b: The content of element 'filter' is not complete. One of '{restrictTo, regex}' is expected.
...
WARN  [RegexMetadataFilter] Regular expression must now be in <regex> tag.

I'm not sure about how to interpret the error message, both because the problematic line is not listed in the log and the message is somewhat cryptic.

I changed my code to remove the "character children" with:

       <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
          <regex>https://(find|openresearch-repository|digitalcollections)[.]anu[.]edu[.]au/(handle/[0-9]+/1/)simple-search([?].*)?</regex>
        </filter>

but now I get:

ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexReferenceFilter: cvc-complex-type.2.2: Element 'filter' must have no element [children], and the value must be valid.
ERROR (XML Validation) RegexReferenceFilter: cvc-minLength-valid: Value '' with length = '0' is not facet-valid with respect to minLength '1' for type 'nonEmptyString'.

According to the doc at: https://www.norconex.com/collectors/collector-core/latest/apidocs/com/norconex/collector/core/filter/impl/RegexReferenceFilter.html my first syntax was right (*).

Could you: 1) help me decrypt the XML validation message so that I understand & can correct my error? 2) make the docs & XML validation code in correspondance?

Thanks

(*) I don't like this first synthax since : 1) it is quite ambiguous: we don't really know where are the boundaries of the regex (are the preceeding/leading spaces part of it?) 2) we cannot have multiple regexes, thus have to multiply occurrences of the parent XML node (<filter>) => crawler code is more verbose.

essiembre commented 7 years ago

The original validation error you tried to fix is not from RegexReferenceFilter. If you check carefully, it comes from RegexMetadataFilter in your importer settings.

So I would revert your reference filter the way you had it and fix the ther one.

liar666 commented 7 years ago

Sorry to continue this issue, but

1) in the doc: https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/RegexReferenceFilter.html in the "Usage example:" section, the <regex> tag is absent.

2) When I use:

         <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include">
          <regex>http://libgen[.]io/scimag/index[.]php[?]s=&amp;journalid=9055&amp;v=128&amp;i=1</regex>
        </filter>

I get the following error:

ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexReferenceFilter: cvc-complex-type.2.2: Element 'filter' must have no element [children], and the value must be valid. ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexReferenceFilter: cvc-minLength-valid: Value '' with length = '0' is not facet-valid with respect to minLength '1' for type 'nonEmptyString

Why?

essiembre commented 7 years ago

Good catch for the com.norconex.importer.handler.filter.impl.RegexReferenceFilter javadoc documentation. It has been fixed.

Your error likely comes not from the Importer module, but from com.norconex.collector.core.filter.impl.RegexReferenceFilter that you must also have configured somewhere. This can be confusing since the class names are the same (but different packages).

One needs the <regex> while the other one does not. We'll consider renaming one of these two in a future release to avoid further confusion.

liar666 commented 7 years ago

OK If I sum up correctly: 1- com.norconex.collector.core.filter.impl.RegexReferenceFilter does NOT require <regex> 2- com.norconex.importer.handler.filter.impl.RegexReferenceFilter DOES require <regex> 3- com.norconex.importer.handler.filter.impl.RegexMetadataFilter DOES require <regex> as well

Why is there a difference between 1 and 2?

By the way, I imagine that using RegexMetadataFilter with field="document.reference" is the same as using RegexReferenceFilter?

essiembre commented 7 years ago

You are correct. There is no valid reason for the difference (regarding needing <regex>) other than there were updated separately over time and they will probably be harmonized some day.

You are also correct about RegexMetadataFilter having the same effect as RegexReferenceFilter when filtering on document.reference. RegexReferenceFilter originally did not exist in the Importer module. It was added for convenience only since filtering on reference is more common.

The difference between the one configured at the Collector level and the one at the Importer level... is that the first one filters URLs before documents are downloaded, whereas the one in the importer does it after. The first one should be preferred when possible. The second one is useful when you want URLs to be extracted from a page but you do not want that page committed.

liar666 commented 7 years ago

Fortunately, that how I use them :)

Thanks a lot again.