Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Can't get DebugTagger to produce any output #30

Closed pcolmer closed 8 years ago

pcolmer commented 8 years ago

I've got the following as my crawler configuration:

<crawlers>
  <crawler id="Wiki Crawler">
    <startURLs stayOnDomain="true">
      <url>http://wiki.linaro.org/</url>
    </startURLs>
    #parse("shared/importer-config.xml")

    <importer>
      <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
          <rename fromField="document.contentEncoding" toField="content_encoding" overwrite="true" />
          <rename fromField="document.contentType" toField="content_type" overwrite="true" />
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
            logContent="true" >
        </tagger>
      </postParseHandlers>
    </importer>
  </crawler>
</crawlers>

When I run the code, though, nothing appears in the output from the DebugTagger.

log4j is set to have the main output on DEBUG:

log4j.logger.com.norconex.collector.http=DEBUG
log4j.logger.com.norconex.collector.core=DEBUG
log4j.logger.com.norconex.importer=DEBUG
log4j.logger.com.norconex.committer=DEBUG

The documentation for DebugTagger says the list of fields is optional, so I'm expecting everything to get dumped in the output/log, but I'm not seeing anything.

essiembre commented 8 years ago

What is the content of "shared/importer-config.xml"? The file name suggests an <importer> section. If so, that means you probably have 2 <importer> section so the second one is ignored (the one with DebugTagger in it). Please confirm.

pcolmer commented 8 years ago

Thanks for spotting that! That will teach me to re-use one of the examples without fully understanding it :)

Just a suggestion: perhaps the code could emit a warning if there are multiple sections encountered and only one is supported?

Removing the included content does indeed get my bits working.

essiembre commented 8 years ago

Glad it is working now. As for being notified when the config is invalid, there is currently a feature request for this, which you can track here: #27.