CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
483 stars 99 forks source link

iText image conversion issue not handled #64

Open eitanf opened 6 years ago

eitanf commented 6 years ago

For some documents (see for example: ASPLOS_17_030.pdf, the Itext.text.pdf converter fails with an exception: Caused by: com.itextpdf.text.pdf.parser.InlineImageUtils$InlineImageParseException: EI not found after end of image data This is ostensibly a known issue with images in a format iText can't convert. CERMINE currently doesn't catch this exception and it crashes. It should catch it and either skip the problematic page or the problematic document.

eitanf commented 6 years ago

I found a separate conversion error that achieves a similar crash. If you run on this file, you get the following exception: Exception in thread "main" java.lang.IllegalArgumentException: Unexpected color space /CS1 at com.itextpdf.text.pdf.parser.InlineImageUtils.getComponentsPerPixel(InlineImageUtils.java:250) at com.itextpdf.text.pdf.parser.InlineImageUtils.computeBytesPerRow(InlineImageUtils.java:263) at com.itextpdf.text.pdf.parser.InlineImageUtils.parseUnfilteredSamples(InlineImageUtils.java:292) at com.itextpdf.text.pdf.parser.InlineImageUtils.parseInlineImageSamples(InlineImageUtils.java:336) at com.itextpdf.text.pdf.parser.InlineImageUtils.parseInlineImage(InlineImageUtils.java:159) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:446) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$FormXObjectDoHandler.handleXObject(PdfContentStreamProcessor.java:1289) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.displayXObject(PdfContentStreamProcessor.java:375) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:83) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:1023) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:310) at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:448) at pl.edu.icm.cermine.structure.ITextCharacterExtractor.extractCharacters(ITextCharacterExtractor.java:112) at pl.edu.icm.cermine.ExtractionUtils.extractCharacters(ExtractionUtils.java:60) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:348) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.doWork(InternalContentExtractor.java:341) at pl.edu.icm.cermine.InternalContentExtractor.getContentAsNLM(InternalContentExtractor.java:301) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:662) at pl.edu.icm.cermine.ContentExtractor.getContentAsNLM(ContentExtractor.java:692) at pl.edu.icm.cermine.ContentExtractor.main(ContentExtractor.java:818)

The "fix" I'm suggesting is similar to the previous one: catch the exception and skip the page. However, it's possible that there's a fix directly to itext.

dtkaczyk commented 6 years ago

Hi @eitanf, thanks for reporting.

I agree such an exception shouldn't make the whole process crash. Skipping the problematic document sounds like a good idea.

I am not sure whether skipping pages by default is a good idea. It results not only in the problematic images missing, but also text fragments missing from the result. In some cases it might be ok, in others not. Instead of doing that by default, I would prefer having a command line option in pl.edu.icm.cermine.ContentExtractor. for this. What do you think?

BTW, it seems to me the second catch in your code is unnecessary, with only ExceptionConverter caught, the second document doesn't crash anymore. Can you confirm that?

eitanf commented 6 years ago

I think that adding a command-line parameter to control skipping page or document on a case-by-case basis is an excellent idea. It should probably be extended to other exceptions caught elsewhere as well.

The second catch in my code is necessary. The second exception wasn't caught by the first catch, which is why I had to add it. Isn't the java.lang exception (IllegalArgumentException) in a different class hierarchy than ExceptionConverter?

dtkaczyk commented 6 years ago

@eitanf If you would still like to contribute this enhancement, please modify your pull request to include a command line option for skipping pages. Since I think skipping pages by default would be confusing, I can't merge this as is.

eitanf commented 6 years ago

I gave it a shot, but don't have an elegant way to do this. I think a global-scale change like this is best done by someone more familiar with the architecture of the software.

dtkaczyk commented 6 years ago

Fair enough. For now, I could merge your pull request, if you modified it so that instead of skipping pages, AnalysisException is thrown.