Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Feature request for com.norconex.importer.handler.tagger.impl.DOMTagger #28

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Following your suggestion in #26 I'm opening this new ticket to suggest the possibility to add a "fromField" in the DOMTagger, so that the user could split the original page into pieces with a first DOMTagger (or any other Tagger that would permit such a thing) and then use other DOMTaggers to "parse" each piece separately. This would allow a "recursive" parsing of the page, for instance to extract multiple people inside a unique organization. Clearly, this tree structure of Taggers, might result in a tree structure for the generated tags. As for how to represent the resulting tags:

  1. First, to keep the configuration as generic as possible, I'm not sure you should enforce any pattern in the generated tags. Maybe the user deploys this "Divide&Conquer" technique to ease her/his writing of the crawler or speed-up things, but still wants a "flat" structure of tags?
  2. In case you want to enforce a tree structure in the generated tags, I think that "representing the hierarchical structure safely" would depend on if you have an idea of the final structure beforehand. If it is the case, then using a naming based on a pre/post/infix-traversal of the tree (possibily with numbering of the nodes to ensure unicity of the names), would suffice. I believe this is the case, as the structure of the configuration file gives you the required information, since the final tag hierarchy is strongly related to the taggers' hierarchy. But I'm not sure how easy it would be to extract this information.

Also a "sub-feature" that would be great is to enable the reuse of a Tagger on various pieces. For instance, the user could split a page in x pieces and run the same Tagger on each piece. I can imagine doing this either by having the "fromField" being a list or by assigning an id to an instance of a DOMTagger and using references to this instance at several places in the configuration file. However, in the latter case, I'm not sure we can guarantee there is not blocking/infinite loop within the Taggers' references.

essiembre commented 8 years ago

Optional support for a "fromField" has been added to the DomTagger along with support for a default value when there is no match. Those are available in the latest importer snapshot.

This enables the logic you wanted to achieve in #26. For example, take this HTML:

<html>
<body>
  <div class="contact">
    <div class="firstName">JoeFirstOnly</div>
  </div>
  <div class="contact">
    <div class="firstName">John</div>
    <div class="lastName">Smith</div>
  </div>
  <div class="contact">
    <div class="lastName">JackLastOnly</div>
  </div>
</body>
</html>

If you use these tagger configs...

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="div.contact" toField="htmlContacts" extract="html" />
  </tagger>
  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
          fromField="htmlContacts">
      <dom selector="div.firstName" toField="firstNames" defaultValue="NO_FIRST_NAME" />
      <dom selector="div.lastName"  toField="lastNames" defaultValue="NO_LAST_NAME" />
  </tagger>

... you will end up with these field values:

firstNames = "JoeFirstOnly", "John",  "NO_FIRST_NAME"
lastNames  = "NO_LAST_NAME", "Smith", "JackLastOnly"

Thanks to the defaultValue filling the blanks , you can then use the values index order to rebuild contact names.

Let me know if that works for you.

essiembre commented 8 years ago

This feature is now part of 2.6.0.