getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.34k stars 346 forks source link

ModuleNotFoundError for py-zerox Module #47

Open shawn8888 opened 1 month ago

shawn8888 commented 1 month ago

I'm encountering a ModuleNotFoundError when trying to import the py-zerox module in my Python project, despite having installed it successfully.

Environment

Python Version: 3.12
Installed Packages:

py-zerox                  0.0.3

Steps to Reproduce:

Install the py-zerox package using:

pip install py-zerox

Attempt to import the module in a Python script:

from pyzerox import zerox

Receive the following error:

ModuleNotFoundError: No module named 'pyzerox'

Could you please assist me in resolving this issue? Any guidance on ensuring that the py-zerox module is recognized would be greatly appreciated.

Thank you!

shawn8888 commented 1 month ago

Found a solution here: https://github.com/getomni-ai/zerox/pull/41

pip uninstall py-zerox
pip install git+https://github.com/getomni-ai/zerox.git

created a .py file:

import os
from pyzerox import zerox
import asyncio

async def main():
    # Set your OpenAI API key
    os.environ["OPENAI_API_KEY"] = "mykey"

    # Path to the PDF file you want to process
    file_path = "PasnewB.PDF"

    # Call the zerox function
    result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")

    # Print the Markdown result
    print(result)

# Run the main function
asyncio.run(main())

ModuleNotFoundError error is fixed. However, still got other errors:

C:\Backup\Projects\python>python hello_zerox.py
Traceback (most recent call last):
  File "C:\Backup\Projects\python\hello_zerox.py", line 19, in <module>
    asyncio.run(main())
  File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Backup\Projects\python\hello_zerox.py", line 13, in main
    result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 91, in zerox
    select_pages = sorted(select_pages)
                   ^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable

Please help! Thanks!

pradhyumna85 commented 1 month ago

@shawn8888, the fix for the second problem is already raised as a PR #40, which is still currently unmerged, but you can still use is for now by uninstalling you py-zerox package and reinstalling with:

pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control

@tylermaran, @annapo23, could you please review PR #40 and merge that.

shawn8888 commented 1 month ago

@pradhyumna85 Thank you for your reply! I have uninstalled 0.0.5 and reinstalled 0.0.6 However, I got another error. I use OpenAI API and the key looks fine to me.

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

ERROR:root:Failed to process image Error:
    Error in Completion Response. Error: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: output_dir', 'type': 'invalid_request_error', 'param': None, 'code': None}}
    Please check the status of your model provider API status.

ZeroxOutput(completion_time=2388.78, file_name='cs101', input_tokens=0, output_tokens=0, pages=[Page(content='', content_length=0, page=1)])
pradhyumna85 commented 1 month ago

@shawn8888, the parameter output_dir is replaced with output_file_path which is the output file path of the md file instead of a directory. Refer: https://github.com/pradhyumna85/zerox/tree/formatting-control?tab=readme-ov-file#usage-1

shawn8888 commented 1 month ago

@pradhyumna85

C:\Backup\Projects\python>python hello_zerox2.py

Traceback (most recent call last):
  File "C:\Backup\Projects\python\hello_zerox2.py", line 48, in <module>
    result = asyncio.run(main())
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Backup\Projects\python\hello_zerox2.py", line 40, in main
    result = await zerox(file_path = file_path, model = model, output_file_path = output_file_path,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 180, in zerox
    await f.write(page_content)
  File "C:\Python312\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
    return await self._loop.run_in_executor(self._executor, cb)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'gbk' codec can't encode character '\xb2' in position 1751: illegal multibyte sequence

image This "gbk codec" seems a language error. My Windows is set to use Chinese for non-Unicode Programs. Any solutions? Thanks!

shawn8888 commented 1 month ago

pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control

I was going to test the installation above on a different PC and got this error:


     Error during installation: Please install Poppler manually from https://poppler.freedesktop.org/
      Pre-install script failed: Command '['C:\\Python312\\python.exe', '-m', 'py_zerox.scripts.pre_install']' returned non-zero exit status 1.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for py-zerox
Failed to build py-zerox
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (py-zerox)
pradhyumna85 commented 1 month ago

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:

  1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases
  2. Unzip the zip to some directory and add the Library/bin folder in the extracted to the PATH variable.
shawn8888 commented 1 month ago

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:

1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases

2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).

It works! Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8? Thanks!

pradhyumna85 commented 1 month ago

@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:

1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases

2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).

It works! Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8? Thanks!

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

shawn8888 commented 1 month ago

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!

I have a couple of questions:

  1. How can I make this a default setting so I don't have to type it every time I run the script?
  2. When the PDF file contains Chinese characters, I encounter an error, even though I’ve tested that gpt-4o-mini does support OCR for Chinese. The error message is: UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.

Here is the pdf I tested: PasnewB.PDF

pradhyumna85 commented 1 month ago

Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.

@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!

I have a couple of questions:

  1. How can I make this a default setting so I don't have to type it every time I run the script?
  2. When the PDF file contains Chinese characters, I encounter an error, even though I’ve tested that gpt-4o-mini does support OCR for Chinese. The error message is: UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.

Here is the pdf I tested: PasnewB.PDF

For 1. you can set it on the OS level, for eg in windows: image

For 2. even I am not sure, if you find a solution then please share here. Edit: Set an environment variable (not inside python) PYTHONUTF8 with value 1 and See if that solves the issue.

Also I would say try to work on linux, you would have a much easier life. If you are on windows then I would recommend you to use WSL 2