Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
I am not sure this is what you mean, but you can set the keepReferrerData to true on your link extractor to keep information on the link that led to a document, as metadata. E.g.:
The target document will then have the following metadata fields added (taken from Javadoc):
Referrer reference: The reference (URL) of the page where the link to a document was found. Metadata value is HttpMetadata.COLLECTOR_REFERRER_REFERENCE (collector.referrer-reference).
Referrer link tag: The tag and attribute names of the link that contained the document reference (URL) in referrer's content. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TAG (collector.referrer-link-tag).
Referrer link text: The text between the tags of the referrer document. Can be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TEXT (collector.referrer-link-text).
Referrer link title: The title attribute of the link that contained the document reference (URL) in referrer's content. Can also be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TITLE (collector.referrer-link-title).
Is this what you are after? You can rename or copy the collector.referrer-link-title field to whatever best suits you using a RenameTagger for instance (part of Importer Module).
I am not sure this is what you mean, but you can set the
keepReferrerData
totrue
on your link extractor to keep information on the link that led to a document, as metadata. E.g.:The target document will then have the following metadata fields added (taken from Javadoc):
Is this what you are after? You can rename or copy the
collector.referrer-link-title
field to whatever best suits you using a RenameTagger for instance (part of Importer Module).