Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

RenameTagger only works with collector.http.* fields #12

Closed kalhomoud closed 11 years ago

kalhomoud commented 11 years ago

Hello,

For some reason, RenameTagger is only working when the field name is starting with collector.http.* such as collector.http.MIMETYPE. It didn't work with me when the field name was "support_url" and "dc:title".

Here is how I have it setup: . . . .

. . . .

Please let me know if you need my config to reproduce the issue.

Thanks, Khalid

essiembre commented 11 years ago

This one works as expected. For the crawler to know about a document metadata, it has to parse it first. If you simply change "preParseHandlers" to "postParseHandlers" it will work. The reason you have "some" metadata available in pre-parse handlers is because whatever the crawler could find from the HTTP Header or extracting URLs is added as extra metadata. To make sure that extra metadata is not mixed up with actual document metadata once the document is parsed, they are prefixed with "collector.http.".