Closed ryangabbbriel closed 7 years ago
You have a few options depending what you want to do exactly. If you want to extract tag values into fields, one option is to use a DOMTagger in the <importer>
section of your HTTP Collector configuration. It could look like this:
<importer>
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div.firstName" toField="firstName" />
<dom selector="div.lastName" toField="lastName" />
</tagger>
</preParseHandlers>
</importer>
There are also other ways offered by the Importer module to extract text from files, like TextBetweenTagger and TextPatternTagger.
I invite you to have a look at the Importer configuration options.
Hi Pascal, I have a question about this DOMTagger implementation. This is link that I want to crawl. This is a snippet from the html:
<div class="info-title">Sponsor:</div>
<div class="info-text" id="sponsor"> Rockefeller University </div>
<div class="info-title">Information provided by (Responsible Party):</div>
<div class="info-text">Dana Orange, Rockefeller University</div>
This is my config:
<importer>
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div.info-title" toField="field1" />
<dom selector="div.info-text" toField="Value1" />
</tagger>
</preParseHandlers>
</importer>
In the .meta file, I get appended output with | as a separator betweem them. Now my question is, how to extract these fields separately? Thank you.
This ticket is closed, please create new tickets for new questions/issues. In your new ticket, please provide a sample .meta file you are getting.
Can you provide me an example configuration? Thanks :)