Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is it possible to save the href title as a part of document ? #195

Closed bruce-genhot closed 8 years ago

essiembre commented 8 years ago

I am not sure this is what you mean, but you can set the keepReferrerData to true on your link extractor to keep information on the link that led to a document, as metadata. E.g.:

<extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor"
    keepReferrerData="true" />

The target document will then have the following metadata fields added (taken from Javadoc):

Is this what you are after? You can rename or copy the collector.referrer-link-title field to whatever best suits you using a RenameTagger for instance (part of Importer Module).

bruce-genhot commented 8 years ago

Thanks, it works for me.

essiembre commented 8 years ago

Thanks for confirming.