clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.73k stars 1.58k forks source link

PDF parser not working #101

Open timnugent opened 9 years ago

timnugent commented 9 years ago

I tried the PDF download/parsing example here: http://www.clips.ua.ac.be/pages/pattern-web#pdf

But ran into this issue:

Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.

from pattern.web import URL, PDF url = URL('http://www.clips.ua.ac.be/sites/default/files/ctrs-002_0.pdf') pdf = PDF(url.download()) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/Pattern-2.6-py2.7.egg/pattern/web/init.py", line 3775, in init self.content = self._parse(path, format=output) File "/usr/local/lib/python2.7/dist-packages/Pattern-2.6-py2.7.egg/pattern/web/init.py", line 3790, in _parse raise PDFError(str(e)) pattern.web.PDFError: must be encoded string without NULL bytes, not str

Using latest version from Git under Ubuntu 14.04.

Cheers, Tim

bsmartt13 commented 9 years ago

Similar error on OS X 10.10.1

Python 2.7.6 (default, Sep  9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
>>> from pattern.web import URL, PDF
>>> url = URL('http://www.clips.ua.ac.be/sites/default/files/ctrs-002_0.pdf')
>>> pdf = PDF(url.download())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/bsmartt/.virt_env/lib/python2.7/site-packages/pattern/web/__init__.py", line 3612, in __init__
    self.content = self._parse(path, format=output)
  File "/Users/bsmartt/.virt_env/lib/python2.7/site-packages/pattern/web/__init__.py", line 3625, in _parse
    process_pdf(m, p, self._open(path), set(), maxpages=0, password="")
  File "/Users/bsmartt/.virt_env/lib/python2.7/site-packages/pattern/web/__init__.py", line 3585, in _open
    if isinstance(path, basestring) and os.path.exists(path):
  File "/Users/bsmartt/.virt_env/lib/python2.7/genericpath.py", line 18, in exists
    os.stat(path)
TypeError: must be encoded string without NULL bytes, not str
dn11 commented 9 years ago

Similar error on Windows 8


PDFError Traceback (most recent call last)

in () ----> 1 pdf = PDF(url.download()) C:\Users\dgn2\AppData\Local\Enthought\Canopy\User\lib\site-packages\pattern\web__init__.pyc in **init**(self, path, output) 3610 3611 def **init**(self, path, output="txt"): -> 3612 self.content = self._parse(path, format=output) 3613 3614 def _parse(self, path, _args, *_kwargs): C:\Users\dgn2\AppData\Local\Enthought\Canopy\User\lib\site-packages\pattern\web__init__.pyc in _parse(self, path, _args, *_kwargs) 3625 process_pdf(m, p, self._open(path), set(), maxpages=0, password="") 3626 except Exception, e: -> 3627 raise PDFError, str(e) 3628 s = s.getvalue() 3629 s = decode_utf8(s) PDFError: must be encoded string without NULL bytes, not str In [206]: pdf = PDF(url.download())
Abhilash-D commented 9 years ago

@Tim, just do pdf = PDF(url.download(unicode=True)) the encoding is the issue here.

timnugent commented 9 years ago

That didn't solve it I'm afraid:

pattern.web.PDFError: must be encoded string without NULL bytes, not unicode

Abhilash-D commented 9 years ago

Oh. It worked for me after I made unicode=True. No idea what would be the issue then.

Leanwit commented 8 years ago

Hi, i get the same problem whit pdf, but the error is the next: "must be encoded string without NULL bytes, not unicode"