CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
482 stars 99 forks source link

Exception in thread "main" java.lang.NullPointerException #94

Open mluerig opened 4 years ago

mluerig commented 4 years ago

I used this java -cp cermine-impl-1.13-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path data_raw/pdfs on a nested folder of pdfs, getting a null pointer exception.

PDF with the issue:

https://www.cloud.luerig.net/index.php/s/CKQRnDePF9aRFwo

My java version (Windows 10 machine):

java version "1.8.0_251"
Java(TM) SE Runtime Environment (build 1.8.0_251-b08)
Java HotSpot(TM) Client VM (build 25.251-b08, mixed mode)

Full error msg:

Exception in thread "main" java.lang.NullPointerException
        at com.itextpdf.text.pdf.parser.PdfImageObject.decodeImageBytes(PdfImageObject.java:298)
        at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:199)
        at com.itextpdf.text.pdf.parser.PdfImageObject.<init>(PdfImageObject.java:168)
        at com.itextpdf.text.pdf.parser.ImageRenderInfo.prepareImageObject(ImageRenderInfo.java:150)
        at com.itextpdf.text.pdf.parser.ImageRenderInfo.getImage(ImageRenderInfo.java:140)
        at pl.edu.icm.cermine.structure.ITextCharacterExtractor$BxDocumentCreator.renderImage(ITextCharacterExtractor.java:366)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$ImageXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1311)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448)
        at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112)
        at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60)
        at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:346)
        at pl.edu.icm.cermine.InternalContentExtractor.getImages(InternalContentExtractor.java:169)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:290)
        at pl.edu.icm.cermine.ContentExtractor.getImages(ContentExtractor.java:307)
        at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:805)
mluerig commented 4 years ago

I just remembered filing a similar issue (https://github.com/CeON/CERMINE/issues/36) a few years ago. back then I asked whether there was a way for exception handling built into CERMINE - is that the case? otherwise I would try to run it from python to skip erroneous attempts.

this is a great tool btw, we are just about to submit our first publication based entirely on results obtained from CERMINE