Open damienfa opened 9 years ago
I think I've encountered the same bug when processing certain PDFs. Looking for a workaround.
My exception is line 46 of pdfminer/utils.py
, i.e. apply_png_predictor()
:
ValueError: Unsupported predictor value: 2
Unfortunately all the PDFs I have that trigger this error contain sensitive data, so I cannot share them. I would welcome hypotheses about what the root cause could be!
Using the latest and greatest:
$ pip3 show pdfminer
Name: pdfminer
Version: 20191020
Summary: PDF parser and analyzer
Home-page: http://github.com/euske/pdfminer
Author: Yusuke Shinyama
Author-email: yusuke@shinyama.jp
License: MIT
Location: /var/tmp/.local/lib/python3.6/site-packages
Requires: pycryptodome
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
test_file = '/path/to/problem/file.pdf'
with open(test_file, 'rb') as fp:
parser = PDFParser(fp)
document = PDFDocument(parser, fallback=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-13c569fd239f> in <module>()
4 with open(test_file, 'rb') as fp:
5 parser = PDFParser(fp)
----> 6 document = PDFDocument(parser, fallback=True)
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
556 try:
557 pos = self.find_xref(parser)
--> 558 self.read_xref_from(parser, pos, self.xrefs)
559 except PDFNoValidXRef:
560 fallback = True
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in read_xref_from(self, parser, start, xrefs)
787 parser.reset()
788 xref = PDFXRefStream()
--> 789 xref.load(parser)
790 else:
791 if token is parser.KEYWORD_XREF:
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdfdocument.py in load(self, parser)
240 self.ranges.extend(choplist(2, index_array))
241 (self.fl1, self.fl2, self.fl3) = stream['W']
--> 242 self.data = stream.get_data()
243 self.entlen = self.fl1+self.fl2+self.fl3
244 self.trailer = stream.attrs
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdftypes.py in get_data(self)
290 def get_data(self):
291 if self.data is None:
--> 292 self.decode()
293 return self.data
294
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/pdftypes.py in decode(self)
281 columns = int_value(params.get('Columns', 1))
282 bitspercomponent = int_value(params.get('BitsPerComponent', 8))
--> 283 data = apply_png_predictor(pred, colors, columns, bitspercomponent, data)
284 else:
285 raise PDFNotImplementedError('Unsupported predictor: %r' % pred)
/var/tmp/.local/lib/python3.6/site-packages/pdfminer/utils.py in apply_png_predictor(pred, colors, columns, bitspercomponent, data)
44 else:
45 # unsupported
---> 46 raise ValueError("Unsupported predictor value: %d"%ft)
47 buf += line2
48 line0 = line2
ValueError: Unsupported predictor value: 2
Thanks for reporting! The patch by @naren8642 and the later fix should address this. Can you try the latest version and see how it works?
Dear Yusuke,
I am a big fan of pdfminer, thanks for creating it!
Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from? euske/pdfminer https://github.com/euske/pdfminer or pdfminer/pdfminer.six https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=2ahUKEwiZwv_n_M3lAhUusaQKHYyEDQ8QFjABegQIBBAB&url=https%3A%2F%2Fgithub.com%2Fpdfminer%2Fpdfminer.six&usg=AOvVaw0DPFg6CdyizPT2IqvDSowV
I was under the impression that before only the fork based on the six package was functioning with Python 3. Are you now upgrading the original repository and should I be switching back?
Thanks for some guidance, Jens
Am 03.11.2019 um 12:47 schrieb Yusuke Shinyama notifications@github.com:
Thanks for reporting! The patch by @naren8642 https://github.com/naren8642 and the later fix should address this. Can you try the latest version and see how it works?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN6SP5W7GBEY4SBV4T3QR2247A5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC5QUBA#issuecomment-549128708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN3MCUBIJVZIR7L52Y3QR2247ANCNFSM4BP5ZLUA.
Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?
As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)
As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.
Looks like the patch from @naren8642 resolved my issue—I'm now able to parse the PDFs I was working with.
Thanks so much to everyone for attending to this!
I was under the impression that before only the fork based on the six package was functioning with Python 3. Are you now upgrading the original repository and should I be switching back?
Half a year ago I started working with pdfminer.six. I am using Python 3 and at that moment it was the way to go. The package is tremendously useful in reading text from pdf's! @euske, thanks for all the work you've put into it!
A couple of months ago I also started contributing to pdfminer.six. It is maintained by a community that is getting more and more active. A list of fixed bugs and new features from the last years is in the changelog.
We are planning on continuously improving pdfminer.six even more in the next months/years. One look at the issue list is enough to see that there are a lot of corner cases that produce bugs.
Starting from January, 2020, pdfminer.six will also drop Python 2 support.
Hi Yusuke,
I just managed to switch back to using your original pdfminer with Python 3. May i suggest that you include the high level API from https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py into your package. It made the start much easier, at least for me. I could imagine that others will also appreciate to be able to use your package with a less steep learning curve.
Best regards, Jens
Am 03.11.2019 um 23:10 schrieb Yusuke Shinyama notifications@github.com:
Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?
As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)
As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN272LQ7OQTAQK62R5LQR5D4PA5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC56GAQ#issuecomment-549184258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN5QILEJ4NOJBFBHG7DQR5D4PANCNFSM4BP5ZLUA.
If you agree, I would be prepared to open a pull request.
Am 10.11.2019 um 13:55 schrieb Jafudi socialnets@jafudi.com:
Hi Yusuke,
I just managed to switch back to using your original pdfminer with Python 3. May i suggest that you include the high level API from https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/high_level.py into your package. It made the start much easier, at least for me. I could imagine that others will also appreciate to be able to use your package with a less steep learning curve.
Best regards, Jens
Am 03.11.2019 um 23:10 schrieb Yusuke Shinyama <notifications@github.com mailto:notifications@github.com>:
Dear Yusuke, I am a big fan of pdfminer, thanks for creating it! Currently, I ask myself the question: With Python3.8 in my development environment, which repo should I be pulling from?
As you said, pdfminer.six is a fork that supports both Python2 and Python3. They might have added other improvements, but I really don't know much about it, as I haven't paid much attention to it. (I didn't join the development because I didn't want to write code that works on Python2 and Python3. They are different languages in my opinion and treated separately.)
As for original pdfminer, its basic functionality hasn't changed since 2014. The recent changes are only for switching to Python3, which is done independently from pdfminer.six. But this is more or less an emergency treatment (as Python2 is being phased out) and I don't have much plan to add any function to it. The original pdfminer will probably stay as it is until Python4? comes out.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/euske/pdfminer/issues/120?email_source=notifications&email_token=AK4YDN272LQ7OQTAQK62R5LQR5D4PA5CNFSM4BP5ZLUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC56GAQ#issuecomment-549184258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4YDN5QILEJ4NOJBFBHG7DQR5D4PANCNFSM4BP5ZLUA.
@naren8642 res
Hi, I am running into the same problem that you had reading certain PDFs. Can you help me with how to use the patch from @naren8642 ?
I am getting the following error reading some PDFs with module pdfplumber which is based on pdfminer.
Traceback (most recent call last):
File "<ipython-input-58-eb095c216011>", line 1, in <module>
pdf = pdfplumber.open("Folder1/pdffile.pdf")
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfplumber\pdf.py", line 46, in open
return cls(open(path_or_fp, "rb"), **kwargs)
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfplumber\pdf.py", line 25, in __init__
self.doc = PDFDocument(PDFParser(stream), password=password)
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 548, in __init__
self.read_xref_from(parser, pos, self.xrefs)
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 802, in read_xref_from
self.read_xref_from(parser, pos, xrefs)
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 791, in read_xref_from
xref.load(parser)
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 228, in load
self.data = stream.get_data()
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdftypes.py", line 319, in get_data
self.decode()
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\pdftypes.py", line 308, in decode
data = apply_png_predictor(pred, colors, columns,
File "C:\Users\Alireza\anaconda3\lib\site-packages\pdfminer\utils.py", line 108, in apply_png_predictor
raise ValueError("Unsupported predictor value: %d" % ft)
ValueError: Unsupported predictor value: 4
Thanks.
The parsing of some PDF fails. I've got the following error :
(if needed I can send you the PDF which fail)
I suppose that the compression on the png is not managed. Is it possible ?