Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

new DOMTagger's "defaultValue not working? #38

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

I'm trying to use the new feature to assign a new value in case no match is found, But I can seems to get it to work.

I've got a general tagger that extract the part of the page with members of an organization:

<tagger>
  ...
  <dom selector="div.founders" toField="ORG_MEMBERS"
       overwrite="true"
       extract="outerHtml" />
  ...
  </tagger>

The another tagger that extracts information for each member:

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" fromField="ORG_MEMBERS">
     <dom selector="div ul li div h5:last-of-type" toField="EXP_NAME"
          overwrite="true"
          extract="ownText" />
      <dom selector="div ul li div h5:nth-of-type(1) a img" toField="EXP_IMAGE"
          overwrite="true"
          defaultValue="no-image"
          extract="attr(src)" />
    </tagger>

Unfortunately, in the .meta file I get: EXP_IMAGE=https://d1qb2nb5cznatu.cloudfront.net/users/89122-medium_jpg?1405520924^|~https://d1qb2nb5cznatu.cloudfront.net/users/1512896-medium_jpg?1441125521 ... EXP_NAME=Krishnan Menon^|~Christian Sutardi^|~Marshall Utoyo^|~Srinivas Sista^|~Filippo Lombardi

So there are 5 names, but only 2 images... None of which is "no-image". Did I make a mistake in my code or is there a bug in the DOMTagger?

FYI, I've just updated the libs to the last ones available on the website, but my code still does not work...

PS: the source page is: http://500.co/startup/fabelio-2/

essiembre commented 8 years ago

The default value is for when there are no match with the selector. In your case there are matches so that is why you do not see the default value.

Remember the dom elements are independent from each other. Sample in ticket #28 shows how what I think you want to accomplish can sometimes be done, if you can first isolate each group of HTML code as distinct elements (storing each in a multi-value field), then your logic could be applied to each.

liar666 commented 8 years ago

Oooops... you're right, my CSS selectors are wrong, as the second one will match everytime...

I've corrected them (and a few things since the code an the website is quite strange - the enclosing div matches twice, one of the instances being empty) and now it works.

Thanks for pointing my error.