Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Encountering java.io.UnsupportedEncodingException when Importer attempts to import a document #125

Closed KarimTantawy closed 3 months ago

KarimTantawy commented 3 months ago

Hi Pascal,

I am currently migrating from version 2.x to 3.x and I am encountering an IMPORTER_HANDLER_ERROR where the Importer is throwing an exception that the charset IBM424_ltr is not supported and it can not import the document. Previously, the documents with this charset where able to be imported and committed without any issues.

Any help would be greatly appreciated.

Here is the relevant part of the log: 15:11:05.564 [crawler-name] INFO IMPORTER_HANDLER_ERROR - site.com/document.pdf - PRE-parse - TextFilter[fieldMatcher=TextMatcher[ignoreCase=false,ignoreDiacritic=false,method=BASIC,partial=false,pattern=document.reference,replaceAll=false],valueMatcher=TextMatcher[ignoreCase=false,ignoreDiacritic=false,method=REGEX,partial=false,pattern=/([^/]*)/\\1/\\1/,replaceAll=false],maxReadSize=10000000,sourceCharset=<null>,onMatch=EXCLUDE,restrictions=[]] 15:11:05.565 [crawler-name] WARN Importer - Could not import document: CrawlDoc[orphan=false,docInfo=HttpDocInfo[depth=4,redirectTrail=<size=0>,referencedUrls=<size=0>,referrerLinkMetadata=attr=href tag=aDate=2024-07-11T15:11:05.560-04:00[America/New_York],state=NEW,contentEncoding=IBM424_ltr,contentType=application/pdf,embeddedParentReferences=<size=0>,reference=reference],metadata=<size=21>] com.norconex.importer.handler.ImporterHandlerException: java.io.UnsupportedEncodingException: IBM424_ltr at com.norconex.importer.handler.filter.AbstractCharStreamFilter.isDocumentMatched(AbstractCharStreamFilter.java:100) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.filter.AbstractDocumentFilter.acceptDocument(AbstractDocumentFilter.java:106) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.acceptDocument(HandlerConsumer.java:143) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.accept(HandlerConsumer.java:115) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.accept(HandlerConsumer.java:63) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.commons.lang.function.Consumers.lambda$accept$0(Consumers.java:64) ~[norconex-commons-lang-2.0.2.jar:2.0.2] at java.util.ArrayList.forEach(Unknown Source) ~[?:1.8.0_401] at com.norconex.commons.lang.function.Consumers.accept(Consumers.java:64) ~[norconex-commons-lang-2.0.2.jar:2.0.2] at com.norconex.importer.Importer.executeHandlers(Importer.java:361) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.Importer.doImportDocument(Importer.java:318) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.Importer.importDocument(Importer.java:179) [norconex-importer-3.0.1.jar:3.0.1] at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) [norconex-commons-lang-2.0.2.jar:2.0.2] at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) [norconex-collector-http-3.0.2.jar:3.0.2] at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:611) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:922) [norconex-collector-core-2.0.2.jar:2.0.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_401] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_401] at java.lang.Thread.run(Unknown Source) [?:1.8.0_401] Caused by: java.io.UnsupportedEncodingException: IBM424_ltr at sun.nio.cs.StreamDecoder.forInputStreamReader(Unknown Source) ~[?:1.8.0_401] at java.io.InputStreamReader.<init>(Unknown Source) ~[?:1.8.0_401] at com.norconex.importer.handler.filter.AbstractCharStreamFilter.isDocumentMatched(AbstractCharStreamFilter.java:97) ~[norconex-importer-3.0.1.jar:3.0.1] ... 20 more

Thank you, Karim Tantawy

essiembre commented 3 months ago

Which Importer "handler" is giving you the exception? From your stacktrace, it seems to be one that you use on PDFs "before" the PDF has been parsed (PRE-parse). If that's the case, you are trying to do a text operation on binary content. Moving the logic you are trying to create after parsing may solve this (post-parse handlers).

If not it, please share your config that reproduces the issue.

KarimTantawy commented 3 months ago

Hi @essiembre,

Thank you very much for the help.

Moving the logic after parsing did fix the issue. It was the "TextFilter" handler that was causing the error.

Best Regards, Karim Tantawy