getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.58k stars 358 forks source link

NoneType for images in pdf.py #72

Open HuyLe82US opened 1 month ago

HuyLe82US commented 1 month ago

When I tried to OCR a .pdf file, I have this error. Here is the log:

ERROR:root:Error converting PDF to images: Unable to get page count. Is poppler installed and in PATH?
Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 149, in zerox
    results = await process_pages_in_batches(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\processor\pdf.py", line 104, in process_pages_in_batches
    for image in images
                 ^^^^^^
TypeError: 'NoneType' object is not iterable

I have installed poppler-utils already, and also checked that the package has already in the project.

HuyLe82US commented 1 month ago

I found out the cause and here is the solution from ChatGPT:

Steps to Resolve

  1. Install Poppler:

    • Poppler is required for converting PDF pages into images. You need to install it on your system.

    On Windows:

    • Download the Poppler binaries from Poppler for Windows.
    • Extract the zip file to a folder (e.g., C:\poppler).

    On macOS:

    • You can install Poppler via Homebrew:
      brew install poppler

    On Linux (Debian/Ubuntu):

    • Install Poppler using the package manager:
      sudo apt-get install poppler-utils
  2. Add Poppler to System PATH:

    If you're on Windows, you'll need to add the bin folder from the Poppler installation to your system's PATH.

    Adding Poppler to PATH (Windows):

    1. Right-click on This PC or My Computer and go to Properties.
    2. Click on Advanced system settings.
    3. In the System Properties window, click on the Environment Variables button.
    4. Under System variables, find the Path variable, and click Edit.
    5. Click New and add the path to the Poppler bin directory (e.g., C:\poppler\bin).
    6. Click OK to close all the windows.
  3. Verify Poppler Installation:

    After installing Poppler and adding it to the PATH, verify that it’s correctly set up by running the following command in your terminal (command prompt or shell):

    pdftoppm -h

    This should display help information for pdftoppm, one of the tools included with Poppler. If you see this, Poppler is correctly installed and added to the PATH.

  4. Retry Running Your Script:

    After ensuring Poppler is installed and available in the PATH, retry running your Python script. The error related to Poppler should be resolved.

Additional Debugging:

If you still encounter issues, make sure:

HuyLe82US commented 1 month ago

After fix that, I have another issue with encoding:

Traceback (most recent call last):
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 22, in <module>
    asyncio.run(main())
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\testZerox.py", line 15, in main
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\pyzerox\core\zerox.py", line 169, in zerox
    await f.write("\n\n".join(aggregated_markdown))
  File "C:\Users\PycharmProjects\pythonProject\venv\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
    return await self._loop.run_in_executor(self._executor, cb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u1edc' in position 1: character maps to <undefined>

I've update the PYTHONIOENCODING=utf-8 already in System Variables.

pradhyumna85 commented 4 weeks ago

@HuyLe82US, please don't follow the INSTRUCTIONS ON THE ABOVE LINK SHARED BY ummm288

@tylermaran, @annapo23 please block the previous comment, the link contains a malware.

Also report the user.

Vamshi-Madineni commented 1 week ago

@HuyLe82US, did this issue get resolved? If not, could you try setting the errors='ignore' parameter when reading the PDF? This will skip any special characters that can't be encoded.