DOMTagger only for html but also index PDF and other Documents

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Apache License 2.0

183 stars 67 forks source link

Hi,

currently i have the situation that i want to only have the "main" content parsed in an html document. Like this:

 <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                        <dom selector="main" toField="content" overwrite="true"/>
                        <restrictTo field="document.contentType">text/html</restrictTo>
</tagger>

But this does not overwrite the content. It only sets a content MetaField.

If i want to upload this i have to configure my CloudSearch commiter to use this MetaField als content.

<committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">
                <serviceEndpoint>XYZ</serviceEndpoint>
                <fixBadIds>true</fixBadIds>
                <sourceContentField>content</sourceContentField>
            </committer>

So for HTML files i got this running. But whats about PDF Files and other documents? They still have there content in the content field and don't have any content MetaField. I was unable to find a "CopyContentToMetaField" config. Or is there a posibility to overwrite the content with the DOM Tagger (or any other config)? The current behavior is that the content for PDF files and other documents which is commited to CloudSearch is empty.

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" > <pattern field="content">.*</pattern> <restrictTo field="document.contentType">^(?!text/html$).*$</restrictTo> </tagger>

Norconex / crawlers

DOMTagger only for html but also index PDF and other Documents #441