Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Newbie Here! Need Help! How can i extract html tags in a website using Norconex #322

Closed ryangabbbriel closed 7 years ago

ryangabbbriel commented 7 years ago

Can you provide me an example configuration? Thanks :)

essiembre commented 7 years ago

You have a few options depending what you want to do exactly. If you want to extract tag values into fields, one option is to use a DOMTagger in the <importer> section of your HTTP Collector configuration. It could look like this:

<importer>
  <preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="div.firstName" toField="firstName" />
      <dom selector="div.lastName"  toField="lastName" />
    </tagger>
  </preParseHandlers>
</importer>

There are also other ways offered by the Importer module to extract text from files, like TextBetweenTagger and TextPatternTagger.

I invite you to have a look at the Importer configuration options.

akshaybijawe commented 7 years ago

Hi Pascal, I have a question about this DOMTagger implementation. This is link that I want to crawl. This is a snippet from the html:

<div class="info-title">Sponsor:</div>
<div class="info-text" id="sponsor"> Rockefeller University </div>
<div class="info-title">Information provided by (Responsible Party):</div>
<div class="info-text">Dana Orange, Rockefeller University</div>

This is my config:

  <importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="div.info-title" toField="field1" />
            <dom selector="div.info-text"  toField="Value1" />
        </tagger>
    </preParseHandlers>
  </importer>

In the .meta file, I get appended output with | as a separator betweem them. Now my question is, how to extract these fields separately? Thank you.

essiembre commented 7 years ago

This ticket is closed, please create new tickets for new questions/issues. In your new ticket, please provide a sample .meta file you are getting.