Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Question - Retain html tags while indexing #702

Closed 4rsood closed 4 years ago

4rsood commented 4 years ago

Hello, wondering if there is a way I can retain all the html formatting while indexing the body contents of a web page.

essiembre commented 4 years ago

One one approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <pattern field="doc_html">.*</pattern>
  </tagger>
4rsood commented 4 years ago

Thank you Pascal. It worked - I am able to copy the entire HTML contents of the document to a field. What if I want to copy the contents in seaparate fields based on certain div tags. For example, I want to copy the entire HTML contents of

<div class="findings"><h2>Findings</h2>Findings contents with html tags go here...</div>

in a separate field.

essiembre commented 4 years ago

For this, DOMTagger is your friend. It will give you a lot of options.

4rsood commented 4 years ago

I have been able to use DOMTagger and extract contents but they all come out as raw text.

essiembre commented 4 years ago

This is expected behavior by default. Have a look a the class documentation for your "extract" options. To keep the HTML, you probably want to use html or outerHtml.

4rsood commented 4 years ago

It worked with the following configuration:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">     
      <dom selector="div.evaluationfindings"
              toField="evaluationfindingshtml"
              extract="html"
               />     
  </tagger>

Thanks again!