alisafaya / txt-from-pdf

Extracting clean text from pdfs using pdfminer.six and pypdf.
Apache License 2.0
1 stars 0 forks source link

CLI program gives lots of errors. #2

Open bulrush15 opened 4 months ago

bulrush15 commented 4 months ago

I'm on Windows 10 with Python 3.12.

CLI program example gives many errors:

INFO:txtfrompdf.__main__:Starting extraction with txt-from-pdf
INFO:txtfrompdf.__main__:Input path: fedex-small50pg.pdf
INFO:txtfrompdf.__main__:Found 1 PDF files
INFO:txtfrompdf.__main__:Extracting: fedex-small50pg.pdf
WARNING:pypdf._reader:Overwriting cache for 0 171
WARNING:pypdf._reader:Overwriting cache for 0 171
WARNING:pypdf.generic._data_structures:PdfReadError("Invalid Elementary Object starting with b's' @556682: b'y\\xf1\\xb8s/\\x1e\\xd6_\\xb8\\xf8\\x1b\\x9b\\xfcB\\xaf\\r\\nendstream\\rendobj\\r17 0 obj\\r<</Contents 18 0 R/CropBox[0.0 0.0 61'")
WARNING:pypdf._reader:Overwriting cache for 0 116
ERROR:txtfrompdf.extract:Generated an exception: 'Length1' for page C:\users\USERNAME\AppData\Local\Temp\tmp6cwjg9ak\18.pdf
Traceback (most recent call last):
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\utils.py", line 21, in temp_directory
    yield temp_dir
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\extract.py", line 117, in _extract_txt_from_pdf
    raise exc
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\extract.py", line 113, in _extract_txt_from_pdf
    texts[page] = future.result()
                  ^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\extract.py", line 84, in pdf_to_text
    interpreter.process_page(page)
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdfinterp.py", line 1014, in render_contents
    self.init_resources(resources)
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdfinterp.py", line 384, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdfinterp.py", line 216, in get_font
    font = PDFType1Font(self, spec)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdffont.py", line 1009, in __init__
    length1 = int_value(self.fontfile["Length1"])
                        ~~~~~~~~~~~~~^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\pdfminer\pdftypes.py", line 285, in __getitem__
    return self.attrs[name]
           ~~~~~~~~~~^^^^^^
KeyError: 'Length1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Scripts\txt-from-pdf.exe\__main__.py", line 7, in <module>
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\__main__.py", line 68, in cli_main
    text = extract_txt_from_pdf(pdf, process_output=not args.no_filter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\extract.py", line 144, in extract_txt_from_pdf
    text = _extract_txt_from_pdf(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\extract.py", line 101, in _extract_txt_from_pdf
    with temp_directory() as temp_dir:
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "C:\users\USERNAME\OneDrive - CONAME\Documents\PythonProjects\CONAME\Test\Fedexpdf\.venv\Lib\site-packages\txtfrompdf\utils.py", line 23, in temp_directory
    shutil.rmtree(temp_dir)
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\shutil.py", line 808, in rmtree
    return _rmtree_unsafe(path, onexc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\shutil.py", line 636, in _rmtree_unsafe
    onexc(os.unlink, fullname, err)
  File "C:\users\USERNAME\AppData\Local\Programs\Python\Python312\Lib\shutil.py", line 634, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\chuck\\AppData\\Local\\Temp\\tmp6cwjg9ak\\13.pdf'
alisafaya commented 4 months ago

This is not tested on Windows. Can you try using it this way:

from txtfrompdf import extract_txt_from_pdf

pdffile = "some_pdf_file.pdf"
text = extract_txt_from_pdf(pdffile, split_into_pages=False)