getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.74k stars 367 forks source link

GPT 4o looping infinitely on certain characters (`...` and `||||`) #38

Open tylermaran opened 2 months ago

tylermaran commented 2 months ago

It looks like some of the more confusing documents will trigger this infinite looping behavior where Zerox will output ... infinitely. The easiest way to reproduce this is using the tax Form 1040. Page 2 of this document tends to have this error pretty consistently.

Zerox config

const result = await zerox({
      filePath,
      model: ModelOptions.gpt_4o,
      openaiAPIKey: process.env.OPENAI_API_KEY,
    });

Example document:

Test 1040.pdf

Input:

image

Zerox Output:

image
pradhyumna85 commented 2 months ago

@tylermaran, may be try tweaking the system prompt to see if that helps.

tylermaran commented 2 months ago

Tried a lot of prompt tweaking variants here. @xdotli added the #22 pr just to add a max tokens because the model would spend 15 minutes just infinitely printing .... before we set a token cap.

Seems like Claude 3.5 didn't have the same problems. So once we have a nodejs equivalent of the litellm work, it'll be easy to switch to different models. But ideally I'd like to find some fix that still works with the openai models.

pradhyumna85 commented 2 months ago

@tylermaran, @xdotli, I am able to get output for page 2 WITHOUT specifying max tokens limit or altering default system prompt by passing the below combination of additional hyperparams one-shot. Have a look at the below code (Uses unmerged PR #39 ):

from pyzerox import zerox
import os

model = "azure/gpt-4o-mini"

os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "2024-02-15-preview" # "2023-05-15"

## extra args for model
kwargs = {"frequency_penalty": 0.5,"top_p": 0.9,"presence_penalty":0.5,"temperature": 0.5}

## system prompt to use for the vision model, None for default
custom_system_prompt = None
file_path = "Test.1040.pdf" ## local filepath and file URL supported

## process only some pages or all
select_pages = [2]

output_dir = "./output_test" ## directory to save the consolidated markdown file

# Define main async entrypoint
async def main():
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
        custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs)
    return result

# run the main function:
result = asyncio.run(main())

consolidated_markdown = "\n\n".join([page.content for page in result.pages])

print(consolidated_markdown) ## only one page

Output (rendered as md):


Form 1040 (2023) Page 2

Tax and Credits 16 Tax (see instructions). Check if any Form(s): 1 8814 2 4972 3 ... 16 17 Amount from Schedule 2, line 3 ........................................... 17 18 Add lines 16 and 17 .......................................................... 18 19 Child tax credit or credit for other dependents from Schedule 8812 .. 19 20 Amount from Schedule 3, line 8 ............................................. 20 21 Add lines 19 and 20 ................................................................... 21 22 Subtract line 21 from line 18. If zero or less, enter -0- .................22 23 Other taxes, including self-employment tax, from Schedule 2, line 21 .23 24 Add lines 22 and 23. This is your total tax ..................................24

Payments 25 Federal income tax withheld from: a Form(s) W-2 ......................................................................25a b Form(s)1099 .....................................................................25b c Other forms (see instructions) ................................................25c 26 Add lines 25a through 25c .......................................................25d 272023 estimated tax payments and amount applied from prior year return .26 28 Earned income credit (EIC) .......................................................27 29 Additional child tax credit from Schedule8812 .........................28 30 American opportunity credit from Form8863, line8 ....................29 31 Amount from Schedule J, line15 ................................................30 32 Add lines27,28,29,and31. These are your total other payments and refundable credits .32 33 Add lines25d,26,and32. These are your total payments ..............33

Refund 34 If line33 is more than line24, subtract line24 from line33. This is the amount you overpaid ....34 35a Amount of line34 you want refunded to you. If Form8888 is attached, check here ...35a
b Routing number ...........38 | ... | ... | ... | Type: A Checking B Savings C Account number ..........29 | ... | ... | ... | 36 Amount on line34 you want applied to your2024 estimated tax ....36

Amount You Owe 37 Subtract line33 from line34. This is the amount you owe ....37 38 Estimated tax penalty (see instructions) .....................................38

Third Party Designee Do you want to allow another person to discuss this return with the IRS? See instructions ............ Yes. Complete below. No. Designee's name .................... Phone no. .................... Personal identification number (PIN)

Sign Here Under penalties of perjury, I declare that I have examined this return and accompanying schedules and statements, and to the best of my knowledge and belief, they are true, correct, and complete. Declaration of preparer (other than taxpayer) is based on all information of which preparer has any knowledge. Your signature ................. Date ................ Your occupation ....... CEO If the IRS sent you an Identity Protection PIN, enter it here (see inst.)

Joint return? Spouse’s signature. If a joint return, both must sign........ Date Spouse’s occupation ...... If the IRS sent your spouse an Identity Protection PIN, enter it here (see inst.) Phone no. ................. Email address .......

Paid Preparer Use Only Preparer’s name ................. Preparer’s signature ............... Date ...... PTIN ...... Firm’s name .................. Phone no. Firm's address ................................................................................................... Go to www.irs.gov/Form1040 for instructions and the latest information. Form1040(2023)


Output doesn't look the best but atleast contains text. You may try a few variations of the hyperparams to see the behavior.

xdotli commented 2 months ago

@pradhyumna85 Hey thanks for playing around with hyper parameters! The temperature and top_p are meant to be stable in production environment so we probably want to patch this by using a different model. Changing hyper parameters is a good direction and I was also playing with it. Thing is even with default param it's not a stably reproducible bug (like 70% of the time).

pradhyumna85 commented 2 months ago

Tried a lot of prompt tweaking variants here. @xdotli added the #22 pr just to add a max tokens because the model would spend 15 minutes just infinitely printing .... before we set a token cap.

Seems like Claude 3.5 didn't have the same problems. So once we have a nodejs equivalent of the litellm work, it'll be easy to switch to different models. But ideally I'd like to find some fix that still works with the openai models.

@tylermaran, for the production use case what I would suggest is (for now till the Vercel ai SDK is integrated) settings up a litellm proxy server (cli setup or docker setup) which will be a gateway to all your providers (anthropic, openai etc) and in the zerox js SDK we just need to add an option to pass the api base url where litellm proxy url will be provided and then you would be able to implement all kinds of fallback logics incase of timeout etc for like switching to call different models seamlessly.

Should we create a branch for adding the api base parameter for js sdk? After that in the readme we can describe on how to start litellm proxy server using yaml config and then zerox js SDK should be able to use it to access any configured providers. Atleast we can have this option till the time we transition to Vercel ai SDK.

Let me know your thoughts