Closed 4rsood closed 4 years ago
One one approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
<restrictTo field="document.contentType">text/html</restrictTo>
<pattern field="doc_html">.*</pattern>
</tagger>
Thank you Pascal. It worked - I am able to copy the entire HTML contents of the document to a field. What if I want to copy the contents in seaparate fields based on certain div tags. For example, I want to copy the entire HTML contents of
<div class="findings"><h2>Findings</h2>Findings contents with html tags go here...</div>
in a separate field.
For this, DOMTagger
is your friend. It will give you a lot of options.
I have been able to use DOMTagger and extract contents but they all come out as raw text.
This is expected behavior by default. Have a look a the class documentation for your "extract" options. To keep the HTML, you probably want to use html
or outerHtml
.
It worked with the following configuration:
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div.evaluationfindings"
toField="evaluationfindingshtml"
extract="html"
/>
</tagger>
Thanks again!
Hello, wondering if there is a way I can retain all the html formatting while indexing the body contents of a web page.