Is it possible to save the href title as a part of document ?

essiembre commented 8 years ago

I am not sure this is what you mean, but you can set the keepReferrerData to true on your link extractor to keep information on the link that led to a document, as metadata. E.g.:

<extractor class="com.norconex.collector.http.url.impl.TikaLinkExtractor"
    keepReferrerData="true" />

The target document will then have the following metadata fields added (taken from Javadoc):

Referrer reference: The reference (URL) of the page where the link to a document was found. Metadata value is HttpMetadata.COLLECTOR_REFERRER_REFERENCE (collector.referrer-reference).
Referrer link tag: The tag and attribute names of the link that contained the document reference (URL) in referrer's content. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TAG (collector.referrer-link-tag).
Referrer link text: The text between the tags of the referrer document. Can be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TEXT (collector.referrer-link-text).
Referrer link title: The title attribute of the link that contained the document reference (URL) in referrer's content. Can also be useful to help establish better document titles. Metadata value is HttpMetadata.COLLECTOR_REFERRER_LINK_TITLE (collector.referrer-link-title).

Is this what you are after? You can rename or copy the collector.referrer-link-title field to whatever best suits you using a RenameTagger for instance (part of Importer Module).

bruce-genhot commented 8 years ago

Thanks, it works for me.

essiembre commented 8 years ago

Thanks for confirming.

Norconex / crawlers

Is it possible to save the href title as a part of document ? #195