Open mushroom-matthew opened 6 years ago
I've started exploring the various output classes of the minecart workflow. I can now see that each of the following, Documents, Pages, Images, Letterings, etc have functions for adjusting bboxes and finding colors. I will explore these more.
That being said, I would still like to know how to adjust sensitivity to words and how to work with various colorspaces as I have encountered the following error.
PDFNotImplementedError: Colorspace 'PDFObjRef:8>' is not supported
You can use the iter_in_bbox
method to inspect all the elements inside a bounding box. So you could, e.g., look at an image, expand the bounding box and then iterate on all letterings inside the bounding box.
There is no sensitivity to adjust here -- minecart extracts text as it was created in the PDF, and unfortunately some PDFs break up text into characters or sub-word strings that you have to manually stitch together.
In terms of the image colors and the error above, it looks like you are using the as_pil
method. Unfortunately, that's a relatively unfinished part of the library. You could try using img.obj.get_data()
and seeing if that will work for you. Otherwise, you can try commenting out lines 274-368 in content.py
to see if that fixes the problem for you.
Thanks for the advice! It helped me start to navigate the classes within this software.
Both of my issues are solved. I thought I'd share my fix for showing images regardless of the PDFObject's Colorspace (specifically if it has 'PDFObjRef:8>').
import io
import PIL
byteArray = image.obj.get_data()
image = PIL.Image.open(io.BytesIO(byteArray))
image.show()
Thanks for sharing! I'll leave this up in case anyone wants to properly improve this part of the library
Felipe -- I am glad that my tip may be helpful in the long run. If this library issue remains, I may take a crack at improving its implementation.
Along those lines, I was wondering if you have explored any other PDF readers or OCR which may be able to better handle some of the other filters, including /CCITTFaxDecode
. From a dev standpoint, is the use of other packages frowned upon or do outside developers such as myself have freedom to test various package deployments?
Definitely not tied to pdfminer! The premise of this library is exposing a nice interface to work with pdfs, so if we can preserve the outer API while changing the internals, I don't really care one way or another. (I imagine that may be challenging though!)
I have not played with any other readers since I wrote this, and haven't had a need to use OCR. OCR would probably be a nice complement though!
In my efforts to "simply" extract the images and captions from my library of ~450 PDFs, I started running into a bunch of problems. While I have some success with the files that were digitally-borne, those that were scanned, for example, were not handled well. In fact, that was just one class of PDFs that weren't handled, where 333 of the 450 documents were dropped for one reason or another. I collected and counted all of the reasons as follows:
"AttributeError: 'PDFGraphicState' object has no attribute 'fill_color'": 15' '"AttributeError: 'PDFGraphicState' object has no attribute 'stroke_color'": 31
"KeyError: 'Cs6'": 1
"KeyError: 'DeviceN'": 1
'OSError: cannot identify image file <_io.BytesIO object at <SOMEKEY>>': 102
"TypeError: a bytes-like object is required, not 'str'": 4,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 11)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 144)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 1849)": 1,
"pdfminer.pdffont.PDFUnicodeNotDefined: ('Adobe-Identity', 59952)": 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 1)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 129)': 7,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 13)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 132)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 160)': 6,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 173)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 176)': 4,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)': 11,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 211)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 213)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 223)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 24)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 25)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 26)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 3)': 3,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 30)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 31)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 4)': 2,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 63)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 8)': 1,
'pdfminer.pdffont.PDFUnicodeNotDefined: (None, 88)': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /CCITTFaxDecode': 113,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JBIG2Decode': 5,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported filter: /JPXDecode': 1,
'pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2': 3
Several of these errors arose from being unable to recognize the various fonts that were embedded in the files. Others weren't able to handle scanned images. It's probably a pretty big task to update the library for all of these cases. It should be helpful, however, to know the potential errors one may face when attempting this task.
I was able to extract the color using this code snippet
import minecart
colors = set()
with open("{pathtoyourPDFhere}.pdf", "rb") as file:
document = minecart.Document(file)
page = document.get_page(0)
for shape in page.shapes:
if shape.fill:
colors.add(shape.fill.color.as_rgb())
for color in colors: print (color)
But as this gives us rgb colors which are not same the colors which are printed i.e DeviceCMYK I have tried ghostscript, imagemagik and other libs but all of them provide a class which does not have to_cmyk() method.
I am looking for contributors who can help me address this issue, let me know if anyone is interested
Hello, I am working with a database .PDFs containing research articles in a niche set of academic areas. I am hoping to extract all of the Figures and captions. In may instances the default settings can do this, but I have found a few instances where the images extracted are incorrectly colored and/or bold-face Figure titles are not being registered as letterings.
One way I could envision working around this is to extract the Images with the found bounding box plus some extra pixel range on the left, right, and bottom. Is there a way to expand the Image class bounding boxes and extract the info in the new bounding boxes?
Alternatively, if you could help me understand how to change the settings used in both image and letterings extraction, that would be very helpful.
Best, mushroom-matthew