Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

CDATA and regex #179

Closed AntonioAmore closed 8 years ago

AntonioAmore commented 8 years ago

I'd like to know is it possible to use CDATA or something similar in regex filters to shield & and other XML language parts may met in urls?

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" field="document.reference" ... >someurl?param1=1&param2=2&param3=.*$</filter>

Replacing such symbols by HTML entities (for example & to &amp;) makes XML syntax correct, but breaks regex semantics.

essiembre commented 8 years ago

Definitely! CDATA is part of the SGML/XML specs and not something specific to the HTTP Collector. You can use it in the body of just any tag in any XML. More info: https://en.wikipedia.org/wiki/CDATA

AntonioAmore commented 8 years ago

Thank you, it works perfectly.

essiembre commented 8 years ago

No problem!