Closed liar666 closed 8 years ago
Optional support for a "fromField" has been added to the DomTagger along with support for a default value when there is no match. Those are available in the latest importer snapshot.
This enables the logic you wanted to achieve in #26. For example, take this HTML:
<html>
<body>
<div class="contact">
<div class="firstName">JoeFirstOnly</div>
</div>
<div class="contact">
<div class="firstName">John</div>
<div class="lastName">Smith</div>
</div>
<div class="contact">
<div class="lastName">JackLastOnly</div>
</div>
</body>
</html>
If you use these tagger configs...
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="div.contact" toField="htmlContacts" extract="html" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"
fromField="htmlContacts">
<dom selector="div.firstName" toField="firstNames" defaultValue="NO_FIRST_NAME" />
<dom selector="div.lastName" toField="lastNames" defaultValue="NO_LAST_NAME" />
</tagger>
... you will end up with these field values:
firstNames = "JoeFirstOnly", "John", "NO_FIRST_NAME"
lastNames = "NO_LAST_NAME", "Smith", "JackLastOnly"
Thanks to the defaultValue
filling the blanks , you can then use the values index order to rebuild contact names.
Let me know if that works for you.
This feature is now part of 2.6.0.
Following your suggestion in #26 I'm opening this new ticket to suggest the possibility to add a "fromField" in the DOMTagger, so that the user could split the original page into pieces with a first DOMTagger (or any other Tagger that would permit such a thing) and then use other DOMTaggers to "parse" each piece separately. This would allow a "recursive" parsing of the page, for instance to extract multiple people inside a unique organization. Clearly, this tree structure of Taggers, might result in a tree structure for the generated tags. As for how to represent the resulting tags:
Also a "sub-feature" that would be great is to enable the reuse of a Tagger on various pieces. For instance, the user could split a page in x pieces and run the same Tagger on each piece. I can imagine doing this either by having the "fromField" being a list or by assigning an id to an instance of a DOMTagger and using references to this instance at several places in the configuration file. However, in the latter case, I'm not sure we can guarantee there is not blocking/infinite loop within the Taggers' references.