felipeochoa / minecart

Simple, Pythonic extraction of text, shapes and images from PDFs
MIT License
79 stars 17 forks source link

Error extracting images #16

Open mushroom-matthew opened 6 years ago

mushroom-matthew commented 6 years ago

Hello, I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In may instances the default settings can do this, but I have found a few instances where the images extracted are incorrectly colored and/or bold-face Figure titles are not being registered as letterings.

One way I could envision working around this is to extract the Images with the found bounding box plus some extra pixel range on the left, right, and bottom. Is there a way to expand the Image class bounding boxes and extract the info in the new bounding boxes?

Alternatively, if you could help me understand how to change the settings used in both image and letterings extraction, that would be very helpful.

Best, mushroom-matthew

mushroom-matthew commented 6 years ago

I've started exploring the various output classes of the minecart workflow. I can now see that each of the following, Documents, Pages, Images, Letterings, etc have functions for adjusting bboxes and finding colors. I will explore these more.

That being said, I would still like to know how to adjust sensitivity to words and how to work with various colorspaces as I have encountered the following error.

PDFNotImplementedError: Colorspace 'PDFObjRef:8>' is not supported

felipeochoa commented 6 years ago

You can use the iter_in_bbox method to inspect all the elements inside a bounding box. So you could, e.g., look at an image, expand the bounding box and then iterate on all letterings inside the bounding box.

There is no sensitivity to adjust here -- minecart extracts text as it was created in the PDF, and unfortunately some PDFs break up text into characters or sub-word strings that you have to manually stitch together.

In terms of the image colors and the error above, it looks like you are using the as_pil method. Unfortunately, that's a relatively unfinished part of the library. You could try using img.obj.get_data() and seeing if that will work for you. Otherwise, you can try commenting out lines 274-368 in content.py to see if that fixes the problem for you.

mushroom-matthew commented 6 years ago

Thanks for the advice! It helped me start to navigate the classes within this software.

Both of my issues are solved. I thought I'd share my fix for showing images regardless of the PDFObject's Colorspace (specifically if it has 'PDFObjRef:8>').

import io
import PIL

byteArray = image.obj.get_data()
image = PIL.Image.open(io.BytesIO(byteArray))
image.show()
felipeochoa commented 6 years ago

Thanks for sharing! I'll leave this up in case anyone wants to properly improve this part of the library

mushroom-matthew commented 6 years ago

Felipe -- I am glad that my tip may be helpful in the long run. If this library issue remains, I may take a crack at improving its implementation.

Along those lines, I was wondering if you have explored any other PDF readers or OCR which may be able to better handle some of the other filters, including /CCITTFaxDecode. From a dev standpoint, is the use of other packages frowned upon or do outside developers such as myself have freedom to test various package deployments?

felipeochoa commented 6 years ago

Definitely not tied to pdfminer! The premise of this library is exposing a nice interface to work with pdfs, so if we can preserve the outer API while changing the internals, I don't really care one way or another. (I imagine that may be challenging though!)

I have not played with any other readers since I wrote this, and haven't had a need to use OCR. OCR would probably be a nice complement though!

mushroom-matthew commented 6 years ago

In my efforts to "simply" extract the images and captions from my library of ~450 PDFs, I started running into a bunch of problems. While I have some success with the files that were digitally-borne, those that were scanned, for example, were not handled well. In fact, that was just one class of PDFs that weren't handled, where 333 of the 450 documents were dropped for one reason or another. I collected and counted all of the reasons as follows: "AttributeError: 'PDFGraphicState' object has no attribute 'fill_color'": 15' '"AttributeError: 'PDFGraphicState' object has no attribute 'stroke_color'": 31 "KeyError: 'Cs6'": 1 "KeyError: 'DeviceN'": 1 'OSError: cannot identify image file <_io.BytesIO object at <SOMEKEY>>': 102 "TypeError: a bytes-like object is required, not 'str'": 4, "pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 11)": 1, "pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 144)": 1, "pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 1849)": 1, "pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 59952)": 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 1)': 3, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 129)': 7, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 13)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 132)': 3, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 160)': 6, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 173)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 176)': 4, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)': 11, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 211)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 213)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 223)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 24)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 25)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 26)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 3)': 3, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 30)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 31)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 4)': 2, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 63)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 8)': 1, 'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 88)': 1, 'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode': 113, 'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JBIG2Decode': 5, 'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JPXDecode': 1, 'pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2': 3

Several of these errors arose from being unable to recognize the various fonts that were embedded in the files. Others weren't able to handle scanned images. It's probably a pretty big task to update the library for all of these cases. It should be helpful, however, to know the potential errors one may face when attempting this task.

Abhiroyq1 commented 6 years ago

for 'PDFGraphicState' object has no attribute 'fill_color'

I have changed some code in pdfinterp.py file in pdfminer module. class PDFGraphicState:- i have added 2 lines in the initialisation as well as "copy" attribute pdfinterp

Akash91 commented 5 years ago

I was able to extract the color using this code snippet

import minecart colors = set()

with open("{pathtoyourPDFhere}.pdf", "rb") as file: document = minecart.Document(file) page = document.get_page(0) for shape in page.shapes: if shape.fill: colors.add(shape.fill.color.as_rgb())

for color in colors: print (color)

But as this gives us rgb colors which are not same the colors which are printed i.e DeviceCMYK I have tried ghostscript, imagemagik and other libs but all of them provide a class which does not have to_cmyk() method.

I am looking for contributors who can help me address this issue, let me know if anyone is interested