AiDAPT-A / VisArchPy

pipelines for the extraction and processing of visuals from PDFs
https://visarchpy.readthedocs.io
MIT License
3 stars 1 forks source link

Problems when extracting 4-bit CMYK images #27

Open manuGil opened 1 year ago

manuGil commented 1 year ago

This is known in PDFminder: https://github.com/pdfminer/pdfminer.six/issues/853

For example, when attempting to save an image as bytes for an element (image) like:

<PDFStream(20): raw=5846880, {'BitsPerComponent': 8, 'ColorSpace': /'DeviceCMYK', 'Decode': [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0],
 'DecodeParms': [{'BitsPerComponent': 4, 'Colors': 4, 'Columns': 2953, 'Predictor': 15}], 'Filter': [/'FlateDecode'], 'Height': 1205, 
'Length': 5846880, 'SMask': <PDFObjRef:19>, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 2953}> [/'DeviceCMYK']

We get the following error:

  File "/home/manuel/Documents/devel/desing-handbook/data-pipelines/src/aidapta/image_pipeline.py", line 139, in main
    image_file_name =iw.export_image(img) # returns image file name, 
  File "/home/manuel/Documents/devel/pdfminer.six/pdfminer/image.py", line 129, in export_image
    name = self._save_bytes(image)
  File "/home/manuel/Documents/devel/pdfminer.six/pdfminer/image.py", line 227, in _save_bytes
    image.stream.get_data()
  File "/home/manuel/Documents/devel/pdfminer.six/pdfminer/pdftypes.py", line 396, in get_data
    self.decode()
  File "/home/manuel/Documents/devel/pdfminer.six/pdfminer/pdftypes.py", line 384, in decode
    data = apply_png_predictor(
  File "/home/manuel/Documents/devel/pdfminer.six/pdfminer/utils.py", line 137, in apply_png_predictor
    raise ValueError(msg)
ValueError: Unsupported `bitspercomponent': 4

The solution proposed in https://github.com/pdfminer/pdfminer.six/pull/854/files doesn't solve this problem.

manuGil commented 1 year ago

For now, I will avoid saving this type of images

manuGil commented 1 year ago

TypeError: No decoding for image with 4 bits and 1 channels. Also related with the lack of a decoder for this type of image. The solution is to write a custom image decoder. PDFMinder support any decoder also supporter by PIL.