CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
483 stars 99 forks source link

Crash: EOL code word encountered in Black run. #79

Open liusida opened 5 years ago

liusida commented 5 years ago

While processing this PDF: http://papers.nips.cc/paper/2328-a-note-on-the-representational-incompatibility-of-function-approximation-and-factored-dynamics.pdf

An exception occurs, and the program crashs.

$ java -cp cermine-impl-1.13-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path bug/
File processed: bug/2328-a-note-on-the-representational-incompatibility-of-function-approximation-and-factored-dynamics.pdf
Exception in thread "main" java.lang.RuntimeException: EOL code word encountered in Black run.
        at com.itextpdf.text.pdf.codec.TIFFFaxDecoder.decodeBlackCodeWord(TIFFFaxDecoder.java:1256)
        at com.itextpdf.text.pdf.codec.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1013)
        at com.itextpdf.text.pdf.FilterHandlers$Filter_CCITTFAXDECODE.decode(FilterHandlers.java:191)
        at com.itextpdf.text.pdf.PdfReader.decodeBytes(PdfReader.java:2619)
        at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:189)
        at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:168)
        at com.itextpdf.text.pdf.parser.ImageRenderInfo.prepareImageObject(ImageRenderInfo.java:150)
        at com.itextpdf.text.pdf.parser.ImageRenderInfo.getImage(ImageRenderInfo.java:140)
        at pl.edu.icm.cermine.structure.ITextCharacterExtractor$BxDocumentCreator.renderImage(ITextCharacterExtractor.java:366)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ImageXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1311)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
        at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
        at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
        at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
        at pl.edu.icm.cermine.InternalContentExtractor.getImages(InternalContentExtractor.java:169)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:290)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:307)
        at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:805)
liusida commented 5 years ago

Another crash, related PDF file: http://papers.nips.cc/paper/6168-estimating-the-class-prior-and-posterior-from-noisy-positives-and-unlabeled-data.pdf

$ java -cp cermine-impl-1.13-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path bug/2/
File processed: bug/2/6168-estimating-the-class-prior-and-posterior-from-noisy-positives-and-unlabeled-data.pdf
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
        at com.itextpdf.text.pdf.fonts.cmaps.CMapToUnicode.convertToInt(CMapToUnicode.java:137)
        at com.itextpdf.text.pdf.fonts.cmaps.CMapToUnicode.createReverseMapping(CMapToUnicode.java:111)
        at com.itextpdf.text.pdf.CMapAwareDocumentFont.processToUnicode(CMapAwareDocumentFont.java:140)
        at com.itextpdf.text.pdf.CMapAwareDocumentFont.initFont(CMapAwareDocumentFont.java:110)
        at com.itextpdf.text.pdf.CMapAwareDocumentFont.<init>(CMapAwareDocumentFont.java:106)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.getFont(PdfContentStreamProcessor.java:162)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$5300(PdfContentStreamProcessor.java:83)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$SetTextFont.invoke(PdfContentStreamProcessor.java:682)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
        at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
        at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
        at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
        at pl.edu.icm.cermine.InternalContentExtractor.getImages(InternalContentExtractor.java:169)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:290)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:307)
        at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:805)

Hope it helps.