getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.74k stars 367 forks source link

Fixes the correct order processing in python #26

Closed michaelfeil closed 2 months ago

michaelfeil commented 2 months ago

Closes #25

pradhyumna85 commented 2 months ago

@michaelfeil, could you explain in bit more detail about why the issue was there with respect to the code and how exactly it was fixed? thanks

michaelfeil commented 2 months ago

@pradhyumna85 The previous code writes to files to /tmp. No additonal subfolders are used.

/tmp/file_page1.png, /tmp/file_page2.png, /tmp/file_page11.png,

Pretty bad Issues with that:

Thanks - low on time, appreciate the review!

michaelfeil commented 2 months ago

@annapo23 Do you have time to review this code?

tylermaran commented 2 months ago

Hey @michaelfeil. Thanks for putting in this PR. We can review and merge today.

@pradhyumna85 this is a similar fix to one we pushed on the NPM side to create unique tmp/:id directories for each document. Otherwise concurrent requests will store document images in the same folder and cause issues.

Ex: https://github.com/getomni-ai/zerox/blob/e5c683b0d65e3714e3e51f51d2c14cd489d562cd/node-zerox/src/index.ts#L40

We also needed to add a leftPad to pages to ensure order would remain consistent. https://github.com/getomni-ai/zerox/blob/e5c683b0d65e3714e3e51f51d2c14cd489d562cd/node-zerox/src/utils.ts#L122

Otherwise we would end up with situations like:

page_1
page_11
page_2
page_3
...
pradhyumna85 commented 2 months ago

@tylermaran agree on that, as usual "sorted" function in python without padding would output incorrectly ordered files by filename, something like this (on sample file paths):

['/tmp/some_dir/file_page_1.png',
 '/tmp/some_dir/file_page_10.png',
 '/tmp/some_dir/file_page_11.png',
 '/tmp/some_dir/file_page_12.png',
 '/tmp/some_dir/file_page_13.png',
 '/tmp/some_dir/file_page_14.png',
 '/tmp/some_dir/file_page_15.png',
 '/tmp/some_dir/file_page_16.png',
 '/tmp/some_dir/file_page_17.png',
 '/tmp/some_dir/file_page_18.png',
 '/tmp/some_dir/file_page_19.png',
 '/tmp/some_dir/file_page_2.png',
 '/tmp/some_dir/file_page_20.png',
 '/tmp/some_dir/file_page_21.png',
 '/tmp/some_dir/file_page_22.png',
 '/tmp/some_dir/file_page_23.png',
 '/tmp/some_dir/file_page_24.png',
 '/tmp/some_dir/file_page_25.png',
 '/tmp/some_dir/file_page_26.png',
 '/tmp/some_dir/file_page_27.png',
 '/tmp/some_dir/file_page_28.png',
 '/tmp/some_dir/file_page_29.png',
 '/tmp/some_dir/file_page_3.png',
 '/tmp/some_dir/file_page_30.png',
 '/tmp/some_dir/file_page_31.png',
 '/tmp/some_dir/file_page_32.png',
 '/tmp/some_dir/file_page_33.png',
 '/tmp/some_dir/file_page_34.png',
 '/tmp/some_dir/file_page_35.png',
 '/tmp/some_dir/file_page_36.png',
 '/tmp/some_dir/file_page_37.png',
 '/tmp/some_dir/file_page_38.png',
 '/tmp/some_dir/file_page_39.png',
 ...
 ]

but I think, instead of the padding approach, I would go with another simpler, cleaner and more general and agnostic approach (doesn't require any padding whatsoever, I am using it for years) of using the below custom sorting method:

import re
def sorted_nicely( l ): 
    """ Sort the given iterable in the way that humans expect -> alphanumerically""" 
    convert = lambda text: int(text) if text.isdigit() else text 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

which would give the expected result:

['/tmp/some_dir/file_page_1.png',
 '/tmp/some_dir/file_page_2.png',
 '/tmp/some_dir/file_page_3.png',
 '/tmp/some_dir/file_page_4.png',
 '/tmp/some_dir/file_page_5.png',
 '/tmp/some_dir/file_page_6.png',
 '/tmp/some_dir/file_page_7.png',
 '/tmp/some_dir/file_page_8.png',
 '/tmp/some_dir/file_page_9.png',
 '/tmp/some_dir/file_page_10.png',
 '/tmp/some_dir/file_page_11.png',
 '/tmp/some_dir/file_page_12.png',
 '/tmp/some_dir/file_page_13.png',
 '/tmp/some_dir/file_page_14.png',
 '/tmp/some_dir/file_page_15.png',
 '/tmp/some_dir/file_page_16.png',
 '/tmp/some_dir/file_page_17.png',
 '/tmp/some_dir/file_page_18.png',
 '/tmp/some_dir/file_page_19.png',
 '/tmp/some_dir/file_page_20.png',
 '/tmp/some_dir/file_page_21.png',
 '/tmp/some_dir/file_page_22.png',
 '/tmp/some_dir/file_page_23.png',
 '/tmp/some_dir/file_page_24.png',
 '/tmp/some_dir/file_page_25.png',
 '/tmp/some_dir/file_page_26.png',
 '/tmp/some_dir/file_page_27.png',
 '/tmp/some_dir/file_page_28.png',
 '/tmp/some_dir/file_page_29.png',
 '/tmp/some_dir/file_page_30.png',
 '/tmp/some_dir/file_page_31.png',
 '/tmp/some_dir/file_page_32.png',
 '/tmp/some_dir/file_page_33.png',
 '/tmp/some_dir/file_page_34.png',
 '/tmp/some_dir/file_page_35.png',
 '/tmp/some_dir/file_page_36.png',
 '/tmp/some_dir/file_page_37.png',
 '/tmp/some_dir/file_page_38.png',
 '/tmp/some_dir/file_page_39.png',
 '/tmp/some_dir/file_page_40.png',
 '/tmp/some_dir/file_page_41.png',
 '/tmp/some_dir/file_page_42.png',
 '/tmp/some_dir/file_page_43.png',
 ...
 ]

Also @michaelfeil, I see you removed the "temp_dir" param from the zerox api, I still believe strongly that it is good to have in case someone wants to give a custom path in case for debugging or analysis. What we can do is set it to "None" by default which would use the "default_tmp_dir/" inside the function, the creation of again we handle in the function and if any custom location is given (non "None") then we create the directory if it doesn't exists and use it.

@tylermaran, once this PR is also finalized and merged, I can update PR https://github.com/getomni-ai/zerox/pull/21 to incorporate these changes as well, let me know what you think.

michaelfeil commented 2 months ago

I work on this on my work hours, fixed the issue and was nice enough to open source it. In case this PR is not useful, please close it or merge it. I can't put more work into such a simple PR.

@pradhyumna85 Disagree. The function I implemented is fixed with a 6 digit padding that i introduced.

It creates a new dir, which solves the issue with cleanup / concurrent request. For debugging, you could use a Python debugger, breakpoint and analyze your tmp dir.

pradhyumna85 commented 2 months ago

I work on this on my work hours, fixed the issue and was nice enough to open source it. In case this PR is not useful, please close it or merge it. I can't put more work into such a simple PR.

@pradhyumna85 Disagree. The function I implemented is fixed with a 6 digit padding that i introduced.

It creates a new dir, which solves the issue with cleanup / concurrent request. For debugging, you could use a Python debugger, breakpoint and analyze your tmp dir.

@michaelfeil My bad sorry, I missed the padding part with the usual sorting. Thanks for your work on this! I didn’t meant to offend you in any way with my suggestion—it was just an alternative idea which doesn't require padding altogether -> shorter filenames. Your solution works great, and I appreciate the effort you’ve put into it. For tmp dir it was just something preferential from my side but your solution also works.