Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

TitleGeneratorTagger error when field text is empty or field doesn't exist #74

Closed jsteggink closed 6 years ago

jsteggink commented 6 years ago

When using the TitleGeneratorTagger it gives a NPE, probably because the field is empty or doesn't exist. Strings shouldn't be initialized as null, but as an empty string or there should be null checks.

java.lang.NullPointerException
        at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
        at java.util.regex.Matcher.reset(Matcher.java:309)
        at java.util.regex.Matcher.<init>(Matcher.java:229)
        at java.util.regex.Pattern.matcher(Pattern.java:1093)
        at com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger.getHeadingTitle(TitleGeneratorTagger.java
:286)
        at com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger.tagStringContent(TitleGeneratorTagger.jav
a:190)
        at com.norconex.importer.handler.tagger.AbstractStringTagger.tagTextDocument(AbstractStringTagger.java:91)
        at com.norconex.importer.handler.tagger.AbstractCharStreamTagger.tagApplicableDocument(AbstractCharStreamTa
gger.java:102)
        at com.norconex.importer.handler.tagger.AbstractDocumentTagger.tagDocument(AbstractDocumentTagger.java:53)
        at com.norconex.importer.Importer.tagDocument(Importer.java:514)
        at com.norconex.importer.Importer.executeHandlers(Importer.java:345)
        at com.norconex.importer.Importer.importDocument(Importer.java:316)
        at com.norconex.importer.Importer.doImportDocument(Importer.java:266)
        at com.norconex.importer.Importer.importDocument(Importer.java:190)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37)
        at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:8
12)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
essiembre commented 6 years ago

A new snapshot release of the Importer module has been made with a fix. Please confirm.

jsteggink commented 6 years ago

Thanks, it works!