Encountering java.io.UnsupportedEncodingException when Importer attempts to import a document

KarimTantawy commented 3 months ago

Hi Pascal,

I am currently migrating from version 2.x to 3.x and I am encountering an IMPORTER_HANDLER_ERROR where the Importer is throwing an exception that the charset IBM424_ltr is not supported and it can not import the document. Previously, the documents with this charset where able to be imported and committed without any issues.

Any help would be greatly appreciated.

Here is the relevant part of the log: 15:11:05.564 [crawler-name] INFO IMPORTER_HANDLER_ERROR - site.com/document.pdf - PRE-parse - TextFilter[fieldMatcher=TextMatcher[ignoreCase=false,ignoreDiacritic=false,method=BASIC,partial=false,pattern=document.reference,replaceAll=false],valueMatcher=TextMatcher[ignoreCase=false,ignoreDiacritic=false,method=REGEX,partial=false,pattern=/([^/]*)/\\1/\\1/,replaceAll=false],maxReadSize=10000000,sourceCharset=<null>,onMatch=EXCLUDE,restrictions=[]] 15:11:05.565 [crawler-name] WARN Importer - Could not import document: CrawlDoc[orphan=false,docInfo=HttpDocInfo[depth=4,redirectTrail=<size=0>,referencedUrls=<size=0>,referrerLinkMetadata=attr=href tag=aDate=2024-07-11T15:11:05.560-04:00[America/New_York],state=NEW,contentEncoding=IBM424_ltr,contentType=application/pdf,embeddedParentReferences=<size=0>,reference=reference],metadata=<size=21>] com.norconex.importer.handler.ImporterHandlerException: java.io.UnsupportedEncodingException: IBM424_ltr at com.norconex.importer.handler.filter.AbstractCharStreamFilter.isDocumentMatched(AbstractCharStreamFilter.java:100) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.filter.AbstractDocumentFilter.acceptDocument(AbstractDocumentFilter.java:106) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.acceptDocument(HandlerConsumer.java:143) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.accept(HandlerConsumer.java:115) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.handler.HandlerConsumer.accept(HandlerConsumer.java:63) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.commons.lang.function.Consumers.lambda$accept$0(Consumers.java:64) ~[norconex-commons-lang-2.0.2.jar:2.0.2] at java.util.ArrayList.forEach(Unknown Source) ~[?:1.8.0_401] at com.norconex.commons.lang.function.Consumers.accept(Consumers.java:64) ~[norconex-commons-lang-2.0.2.jar:2.0.2] at com.norconex.importer.Importer.executeHandlers(Importer.java:361) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.Importer.doImportDocument(Importer.java:318) ~[norconex-importer-3.0.1.jar:3.0.1] at com.norconex.importer.Importer.importDocument(Importer.java:179) [norconex-importer-3.0.1.jar:3.0.1] at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) [norconex-commons-lang-2.0.2.jar:2.0.2] at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) [norconex-collector-http-3.0.2.jar:3.0.2] at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:611) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) [norconex-collector-core-2.0.2.jar:2.0.2] at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:922) [norconex-collector-core-2.0.2.jar:2.0.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_401] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_401] at java.lang.Thread.run(Unknown Source) [?:1.8.0_401] Caused by: java.io.UnsupportedEncodingException: IBM424_ltr at sun.nio.cs.StreamDecoder.forInputStreamReader(Unknown Source) ~[?:1.8.0_401] at java.io.InputStreamReader.<init>(Unknown Source) ~[?:1.8.0_401] at com.norconex.importer.handler.filter.AbstractCharStreamFilter.isDocumentMatched(AbstractCharStreamFilter.java:97) ~[norconex-importer-3.0.1.jar:3.0.1] ... 20 more

Thank you, Karim Tantawy

essiembre commented 3 months ago

Which Importer "handler" is giving you the exception? From your stacktrace, it seems to be one that you use on PDFs "before" the PDF has been parsed (PRE-parse). If that's the case, you are trying to do a text operation on binary content. Moving the logic you are trying to create after parsing may solve this (post-parse handlers).

If not it, please share your config that reproduces the issue.

KarimTantawy commented 3 months ago

Hi @essiembre,

Thank you very much for the help.

Moving the logic after parsing did fix the issue. It was the "TextFilter" handler that was causing the error.

Best Regards, Karim Tantawy

Norconex / importer

Encountering java.io.UnsupportedEncodingException when Importer attempts to import a document #125