euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

ValueError raised on parsing some PDF (apply_png_predictor) #120

Open damienfa opened 9 years ago

damienfa commented 9 years ago

The parsing of some PDF fails. I've got the following error :

File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pdfminer/utils.py", line 47, in apply_png_predictor
    raise ValueError(ft)
ValueError: 2

(if needed I can send you the PDF which fail)

I suppose that the compression on the png is not managed. Is it possible ?

SigmaX commented 4 years ago

I think I've encountered the same bug when processing certain PDFs. Looking for a workaround.

My exception is line 46 of pdfminer/utils.py, i.e. apply_png_predictor():

ValueError: Unsupported predictor value: 2

Unfortunately all the PDFs I have that trigger this error contain sensitive data, so I cannot share them. I would welcome hypotheses about what the root cause could be!


Version

Using the latest and greatest:

$ pip3 show pdfminer
Name: pdfminer
Version: 20191020
Summary: PDF parser and analyzer
Home-page: http://github.com/euske/pdfminer
Author: Yusuke Shinyama
Author-email: yusuke@shinyama.jp
License: MIT
Location: /var/tmp/.local/lib/python3.6/site-packages
Requires: pycryptodome

MWE

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser

test_file = '/path/to/problem/file.pdf'

with open(test_file, 'rb') as fp:
    parser = PDFParser(fp)
    document = PDFDocument(parser, fallback=True)

Full Trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-13c569fd239f> in <module>()
      4 with open(test_file, 'rb') as fp:
      5     parser = PDFParser(fp)
----> 6     document = PDFDocument(parser, fallback=True)

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
    556         try:
    557             pos = self.find_xref(parser)
--> 558             self.read_xref_from(parser, pos, self.xrefs)
    559         except PDFNoValidXRef:
    560             fallback = True

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in read_xref_from(self, parser, start, xrefs)
    787             parser.reset()
    788             xref = PDFXRefStream()
--> 789             xref.load(parser)
    790         else:
    791             if token is parser.KEYWORD_XREF:

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in load(self, parser)
    240         self.ranges.extend(choplist(2, index_array))
    241         (self.fl1, self.fl2, self.fl3) = stream['W']
--> 242         self.data = stream.get_data()
    243         self.entlen = self.fl1+self.fl2+self.fl3
    244         self.trailer = stream.attrs

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdftypes.py in get_data(self)
    290     def get_data(self):
    291         if self.data is None:
--> 292             self.decode()
    293         return self.data
    294 

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdftypes.py in decode(self)
    281                     columns = int_value(params.get('Columns', 1))
    282                     bitspercomponent = int_value(params.get('BitsPerComponent', 8))
--> 283                     data = apply_png_predictor(pred, colors, columns, bitspercomponent, data)
    284                 else:
    285                     raise PDFNotImplementedError('Unsupported predictor: %r' % pred)

/var/tmp/.local/lib/python3.6/site-packages/pdfminer/utils.py in apply_png_predictor(pred, colors, columns, bitspercomponent, data)
     44         else:
     45             # unsupported
---> 46             raise ValueError("Unsupported predictor value: %d"%ft)
     47         buf += line2
     48         line0 = line2

ValueError: Unsupported predictor value: 2
euske commented 4 years ago

Thanks for reporting! The patch by @naren8642 and the later fix should address this. Can you try the latest version and see how it works?

jafudi commented 4 years ago

Dear Yusuke,

I am a big fan of pdfminer, thanks for creating it!

Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from? euske/pdfminer https://github.com/euske/pdfminer or pdfminer/pdfminer.six https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwiZwv_n_M3lAhUusaQKHYyEDQ8QFjABegQIBBAB&url=https%3A%2F%2Fgithub.com%2Fpdfminer%2Fpdfminer.six&usg=AOvVaw0DPFg6CdyizPT2IqvDSowV

I was under the impression that before only the fork based on the six package was functioning with Python 3. Are you now upgrading the original repository and should I be switching back?

Thanks for some guidance, Jens

Am 03.11.2019 um 12:47 schrieb Yusuke Shinyama notifications@github.com:

Thanks for reporting! The patch by @naren8642 https://github.com/naren8642 and the later fix should address this. Can you try the latest version and see how it works?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN6SP5W7GBEY4SBV4T3QR2247A5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC5QUBA#issuecomment-549128708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN3MCUBIJVZIR7L52Y3QR2247ANCNFSM4BP5ZLUA.

euske commented 4 years ago

Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?

As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)

As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.

SigmaX commented 4 years ago

Looks like the patch from @naren8642 resolved my issue—I'm now able to parse the PDFs I was working with.

Thanks so much to everyone for attending to this!

pietermarsman commented 4 years ago

I was under the impression that before only the fork based on the six package was functioning with Python 3. Are you now upgrading the original repository and should I be switching back?

Half a year ago I started working with pdfminer.six. I am using Python 3 and at that moment it was the way to go. The package is tremendously useful in reading text from pdf's! @euske, thanks for all the work you've put into it!

A couple of months ago I also started contributing to pdfminer.six. It is maintained by a community that is getting more and more active. A list of fixed bugs and new features from the last years is in the changelog.

We are planning on continuously improving pdfminer.six even more in the next months/years. One look at the issue list is enough to see that there are a lot of corner cases that produce bugs.

Starting from January, 2020, pdfminer.six will also drop Python 2 support.

jafudi commented 4 years ago

Hi Yusuke,

I just managed to switch back to using your original pdfminer with Python 3. May i suggest that you include the high level API from https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py into your package. It made the start much easier, at least for me. I could imagine that others will also appreciate to be able to use your package with a less steep learning curve.

Best regards, Jens

Am 03.11.2019 um 23:10 schrieb Yusuke Shinyama notifications@github.com:

Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?

As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)

As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN272LQ7OQTAQK62R5LQR5D4PA5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC56GAQ#issuecomment-549184258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN5QILEJ4NOJBFBHG7DQR5D4PANCNFSM4BP5ZLUA.

jafudi commented 4 years ago

If you agree, I would be prepared to open a pull request.

Am 10.11.2019 um 13:55 schrieb Jafudi socialnets@jafudi.com:

Hi Yusuke,

I just managed to switch back to using your original pdfminer with Python 3. May i suggest that you include the high level API from https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py into your package. It made the start much easier, at least for me. I could imagine that others will also appreciate to be able to use your package with a less steep learning curve.

Best regards, Jens

Am 03.11.2019 um 23:10 schrieb Yusuke Shinyama <notifications@github.com mailto:notifications@github.com>:

Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?

As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)

As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN272LQ7OQTAQK62R5LQR5D4PA5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC56GAQ#issuecomment-549184258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN5QILEJ4NOJBFBHG7DQR5D4PANCNFSM4BP5ZLUA.

alireza-em commented 3 years ago

@naren8642 res

Hi, I am running into the same problem that you had reading certain PDFs. Can you help me with how to use the patch from @naren8642 ?

I am getting the following error reading some PDFs with module pdfplumber which is based on pdfminer.

Traceback (most recent call last):

  File "<ipython-input-58-eb095c216011>", line 1, in <module>
    pdf  = pdfplumber.open("Folder1/pdffile.pdf")

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfplumber\pdf.py", line 46, in open
    return cls(open(path_or_fp, "rb"), **kwargs)

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfplumber\pdf.py", line 25, in __init__
    self.doc = PDFDocument(PDFParser(stream), password=password)

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 548, in __init__
    self.read_xref_from(parser, pos, self.xrefs)

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 802, in read_xref_from
    self.read_xref_from(parser, pos, xrefs)

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 791, in read_xref_from
    xref.load(parser)

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 228, in load
    self.data = stream.get_data()

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdftypes.py", line 319, in get_data
    self.decode()

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdftypes.py", line 308, in decode
    data = apply_png_predictor(pred, colors, columns,

  File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\utils.py", line 108, in apply_png_predictor
    raise ValueError("Unsupported predictor value: %d" % ft)

ValueError: Unsupported predictor value: 4

Thanks.