Closed liar666 closed 7 years ago
The original validation error you tried to fix is not from RegexReferenceFilter
. If you check carefully, it comes from RegexMetadataFilter in your importer settings.
So I would revert your reference filter the way you had it and fix the ther one.
Sorry to continue this issue, but
1) in the doc:
https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/filter/impl/RegexReferenceFilter.html
in the "Usage example:" section, the <regex>
tag is absent.
2) When I use:
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
onMatch="include">
<regex>http://libgen[.]io/scimag/index[.]php[?]s=&journalid=9055&v=128&i=1</regex>
</filter>
I get the following error:
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexReferenceFilter: cvc-complex-type.2.2: Element 'filter' must have no element [children], and the value must be valid. ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) RegexReferenceFilter: cvc-minLength-valid: Value '' with length = '0' is not facet-valid with respect to minLength '1' for type 'nonEmptyString
Why?
Good catch for the com.norconex.importer.handler.filter.impl.RegexReferenceFilter
javadoc documentation. It has been fixed.
Your error likely comes not from the Importer module, but from com.norconex.collector.core.filter.impl.RegexReferenceFilter
that you must also have configured somewhere. This can be confusing since the class names are the same (but different packages).
One needs the <regex>
while the other one does not. We'll consider renaming one of these two in a future release to avoid further confusion.
OK If I sum up correctly:
1- com.norconex.collector.core.filter.impl.RegexReferenceFilter
does NOT require <regex>
2- com.norconex.importer.handler.filter.impl.RegexReferenceFilter
DOES require <regex>
3- com.norconex.importer.handler.filter.impl.RegexMetadataFilter
DOES require <regex>
as well
Why is there a difference between 1 and 2?
By the way, I imagine that using RegexMetadataFilter
with field="document.reference"
is the same as using RegexReferenceFilter
?
You are correct. There is no valid reason for the difference (regarding needing <regex>
) other than there were updated separately over time and they will probably be harmonized some day.
You are also correct about RegexMetadataFilter
having the same effect as RegexReferenceFilter
when filtering on document.reference
. RegexReferenceFilter originally did not exist in the Importer module. It was added for convenience only since filtering on reference is more common.
The difference between the one configured at the Collector level and the one at the Importer level... is that the first one filters URLs before documents are downloaded, whereas the one in the importer does it after. The first one should be preferred when possible. The second one is useful when you want URLs to be extracted from a page but you do not want that page committed.
Fortunately, that how I use them :)
Thanks a lot again.
I originally wrote a crawler with:
Running
collector-http.sh --checkcfg
resulted in:I'm not sure about how to interpret the error message, both because the problematic line is not listed in the log and the message is somewhat cryptic.
I changed my code to remove the "character children" with:
but now I get:
According to the doc at: https://www.norconex.com/collectors/collector-core/latest/apidocs/com/norconex/collector/core/filter/impl/RegexReferenceFilter.html my first syntax was right (*).
Could you: 1) help me decrypt the XML validation message so that I understand & can correct my error? 2) make the docs & XML validation code in correspondance?
Thanks
(*) I don't like this first synthax since : 1) it is quite ambiguous: we don't really know where are the boundaries of the regex (are the preceeding/leading spaces part of it?) 2) we cannot have multiple regexes, thus have to multiply occurrences of the parent XML node (
<filter>
) => crawler code is more verbose.