Open shawn8888 opened 1 month ago
Found a solution here: https://github.com/getomni-ai/zerox/pull/41
pip uninstall py-zerox
pip install git+https://github.com/getomni-ai/zerox.git
created a .py file:
import os
from pyzerox import zerox
import asyncio
async def main():
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "mykey"
# Path to the PDF file you want to process
file_path = "PasnewB.PDF"
# Call the zerox function
result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")
# Print the Markdown result
print(result)
# Run the main function
asyncio.run(main())
ModuleNotFoundError error is fixed. However, still got other errors:
C:\Backup\Projects\python>python hello_zerox.py
Traceback (most recent call last):
File "C:\Backup\Projects\python\hello_zerox.py", line 19, in <module>
asyncio.run(main())
File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Backup\Projects\python\hello_zerox.py", line 13, in main
result = await zerox(file_path=file_path, model="gpt-4o-mini", output_dir="./output")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 91, in zerox
select_pages = sorted(select_pages)
^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable
Please help! Thanks!
@shawn8888, the fix for the second problem is already raised as a PR #40, which is still currently unmerged, but you can still use is for now by uninstalling you py-zerox package and reinstalling with:
pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control
@tylermaran, @annapo23, could you please review PR #40 and merge that.
@pradhyumna85 Thank you for your reply! I have uninstalled 0.0.5 and reinstalled 0.0.6 However, I got another error. I use OpenAI API and the key looks fine to me.
[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.
ERROR:root:Failed to process image Error:
Error in Completion Response. Error: litellm.BadRequestError: OpenAIException - Error code: 400 - {'error': {'message': 'Unrecognized request argument supplied: output_dir', 'type': 'invalid_request_error', 'param': None, 'code': None}}
Please check the status of your model provider API status.
ZeroxOutput(completion_time=2388.78, file_name='cs101', input_tokens=0, output_tokens=0, pages=[Page(content='', content_length=0, page=1)])
@shawn8888, the parameter output_dir is replaced with output_file_path which is the output file path of the md file instead of a directory. Refer: https://github.com/pradhyumna85/zerox/tree/formatting-control?tab=readme-ov-file#usage-1
@pradhyumna85
C:\Backup\Projects\python>python hello_zerox2.py
Traceback (most recent call last):
File "C:\Backup\Projects\python\hello_zerox2.py", line 48, in <module>
result = asyncio.run(main())
^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\asyncio\runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\asyncio\base_events.py", line 685, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Backup\Projects\python\hello_zerox2.py", line 40, in main
result = await zerox(file_path = file_path, model = model, output_file_path = output_file_path,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\site-packages\pyzerox\core\zerox.py", line 180, in zerox
await f.write(page_content)
File "C:\Python312\Lib\site-packages\aiofiles\threadpool\utils.py", line 43, in method
return await self._loop.run_in_executor(self._executor, cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python312\Lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'gbk' codec can't encode character '\xb2' in position 1751: illegal multibyte sequence
This "gbk codec" seems a language error. My Windows is set to use Chinese for non-Unicode Programs. Any solutions? Thanks!
pip install git+https://github.com/pradhyumna85/zerox.git@formatting-control
I was going to test the installation above on a different PC and got this error:
Error during installation: Please install Poppler manually from https://poppler.freedesktop.org/
Pre-install script failed: Command '['C:\\Python312\\python.exe', '-m', 'py_zerox.scripts.pre_install']' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for py-zerox
Failed to build py-zerox
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (py-zerox)
@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:
@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:
1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases 2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).
It works! Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8? Thanks!
@shawn8888, Install poppler utils manually using prebuilt binaries as you are on windows and then try pip install again. Steps to install poppler utils prebuilt binaries on windows:
1. Download the latest prebuilt binary zip from https://github.com/oschwartz10612/poppler-windows/releases 2. Unzip the zip to some directory and add the **Library/bin** folder in the extracted to the [PATH variable](https://stackoverflow.com/questions/44272416/how-to-add-a-folder-to-path-environment-variable-in-windows-10-with-screensho).
It works! Could you please also check the "gbk codec" error above? Maybe change the output.md file encoding to be UTF-8? Thanks!
Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.
Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.
@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!
I have a couple of questions:
UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.
Here is the pdf I tested: PasnewB.PDF
Set an environment variable (not inside python) PYTHONIOENCODING with value utf-8 and See if that solves the issue.
@pradhyumna85 You are the best! After setting PYTHONIOENCODING=utf-8 in CMD, the program works!
I have a couple of questions:
- How can I make this a default setting so I don't have to type it every time I run the script?
- When the PDF file contains Chinese characters, I encounter an error, even though I’ve tested that gpt-4o-mini does support OCR for Chinese. The error message is:
UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f914' in position 2: illegal multibyte sequence.
Here is the pdf I tested: PasnewB.PDF
For 1. you can set it on the OS level, for eg in windows:
For 2. even I am not sure, if you find a solution then please share here. Edit: Set an environment variable (not inside python) PYTHONUTF8 with value 1 and See if that solves the issue.
Also I would say try to work on linux, you would have a much easier life. If you are on windows then I would recommend you to use WSL 2
I'm encountering a ModuleNotFoundError when trying to import the py-zerox module in my Python project, despite having installed it successfully.
Environment
Steps to Reproduce:
Install the py-zerox package using:
Attempt to import the module in a Python script:
Receive the following error:
Could you please assist me in resolving this issue? Any guidance on ensuring that the py-zerox module is recognized would be greatly appreciated.
Thank you!