Newbie Here! Need Help! How can i extract html tags in a website using Norconex

ryangabbbriel commented 7 years ago

Can you provide me an example configuration? Thanks :)

essiembre commented 7 years ago

You have a few options depending what you want to do exactly. If you want to extract tag values into fields, one option is to use a DOMTagger in the <importer> section of your HTTP Collector configuration. It could look like this:

<importer>
  <preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="div.firstName" toField="firstName" />
      <dom selector="div.lastName"  toField="lastName" />
    </tagger>
  </preParseHandlers>
</importer>

There are also other ways offered by the Importer module to extract text from files, like TextBetweenTagger and TextPatternTagger.

I invite you to have a look at the Importer configuration options.

akshaybijawe commented 7 years ago

Hi Pascal, I have a question about this DOMTagger implementation. This is link that I want to crawl. This is a snippet from the html:

<div class="info-title">Sponsor:</div>
<div class="info-text" id="sponsor"> Rockefeller University </div>
<div class="info-title">Information provided by (Responsible Party):</div>
<div class="info-text">Dana Orange, Rockefeller University</div>

This is my config:

  <importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="div.info-title" toField="field1" />
            <dom selector="div.info-text"  toField="Value1" />
        </tagger>
    </preParseHandlers>
  </importer>

In the .meta file, I get appended output with | as a separator betweem them. Now my question is, how to extract these fields separately? Thank you.

essiembre commented 7 years ago

This ticket is closed, please create new tickets for new questions/issues. In your new ticket, please provide a sample .meta file you are getting.

Norconex / crawlers

Newbie Here! Need Help! How can i extract html tags in a website using Norconex #322