Closed michaelfeil closed 2 months ago
@michaelfeil, could you explain in bit more detail about why the issue was there with respect to the code and how exactly it was fixed? thanks
@pradhyumna85 The previous code writes to files to /tmp
. No additonal subfolders are used.
/tmp/file_page1.png, /tmp/file_page2.png, /tmp/file_page11.png,
Pretty bad Issues with that:
Thanks - low on time, appreciate the review!
@annapo23 Do you have time to review this code?
Hey @michaelfeil. Thanks for putting in this PR. We can review and merge today.
@pradhyumna85 this is a similar fix to one we pushed on the NPM side to create unique tmp/:id
directories for each document. Otherwise concurrent requests will store document images in the same folder and cause issues.
We also needed to add a leftPad
to pages to ensure order would remain consistent.
https://github.com/getomni-ai/zerox/blob/e5c683b0d65e3714e3e51f51d2c14cd489d562cd/node-zerox/src/utils.ts#L122
Otherwise we would end up with situations like:
page_1
page_11
page_2
page_3
...
@tylermaran agree on that, as usual "sorted" function in python without padding would output incorrectly ordered files by filename, something like this (on sample file paths):
['/tmp/some_dir/file_page_1.png',
'/tmp/some_dir/file_page_10.png',
'/tmp/some_dir/file_page_11.png',
'/tmp/some_dir/file_page_12.png',
'/tmp/some_dir/file_page_13.png',
'/tmp/some_dir/file_page_14.png',
'/tmp/some_dir/file_page_15.png',
'/tmp/some_dir/file_page_16.png',
'/tmp/some_dir/file_page_17.png',
'/tmp/some_dir/file_page_18.png',
'/tmp/some_dir/file_page_19.png',
'/tmp/some_dir/file_page_2.png',
'/tmp/some_dir/file_page_20.png',
'/tmp/some_dir/file_page_21.png',
'/tmp/some_dir/file_page_22.png',
'/tmp/some_dir/file_page_23.png',
'/tmp/some_dir/file_page_24.png',
'/tmp/some_dir/file_page_25.png',
'/tmp/some_dir/file_page_26.png',
'/tmp/some_dir/file_page_27.png',
'/tmp/some_dir/file_page_28.png',
'/tmp/some_dir/file_page_29.png',
'/tmp/some_dir/file_page_3.png',
'/tmp/some_dir/file_page_30.png',
'/tmp/some_dir/file_page_31.png',
'/tmp/some_dir/file_page_32.png',
'/tmp/some_dir/file_page_33.png',
'/tmp/some_dir/file_page_34.png',
'/tmp/some_dir/file_page_35.png',
'/tmp/some_dir/file_page_36.png',
'/tmp/some_dir/file_page_37.png',
'/tmp/some_dir/file_page_38.png',
'/tmp/some_dir/file_page_39.png',
...
]
but I think, instead of the padding approach, I would go with another simpler, cleaner and more general and agnostic approach (doesn't require any padding whatsoever, I am using it for years) of using the below custom sorting method:
import re
def sorted_nicely( l ):
""" Sort the given iterable in the way that humans expect -> alphanumerically"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
which would give the expected result:
['/tmp/some_dir/file_page_1.png',
'/tmp/some_dir/file_page_2.png',
'/tmp/some_dir/file_page_3.png',
'/tmp/some_dir/file_page_4.png',
'/tmp/some_dir/file_page_5.png',
'/tmp/some_dir/file_page_6.png',
'/tmp/some_dir/file_page_7.png',
'/tmp/some_dir/file_page_8.png',
'/tmp/some_dir/file_page_9.png',
'/tmp/some_dir/file_page_10.png',
'/tmp/some_dir/file_page_11.png',
'/tmp/some_dir/file_page_12.png',
'/tmp/some_dir/file_page_13.png',
'/tmp/some_dir/file_page_14.png',
'/tmp/some_dir/file_page_15.png',
'/tmp/some_dir/file_page_16.png',
'/tmp/some_dir/file_page_17.png',
'/tmp/some_dir/file_page_18.png',
'/tmp/some_dir/file_page_19.png',
'/tmp/some_dir/file_page_20.png',
'/tmp/some_dir/file_page_21.png',
'/tmp/some_dir/file_page_22.png',
'/tmp/some_dir/file_page_23.png',
'/tmp/some_dir/file_page_24.png',
'/tmp/some_dir/file_page_25.png',
'/tmp/some_dir/file_page_26.png',
'/tmp/some_dir/file_page_27.png',
'/tmp/some_dir/file_page_28.png',
'/tmp/some_dir/file_page_29.png',
'/tmp/some_dir/file_page_30.png',
'/tmp/some_dir/file_page_31.png',
'/tmp/some_dir/file_page_32.png',
'/tmp/some_dir/file_page_33.png',
'/tmp/some_dir/file_page_34.png',
'/tmp/some_dir/file_page_35.png',
'/tmp/some_dir/file_page_36.png',
'/tmp/some_dir/file_page_37.png',
'/tmp/some_dir/file_page_38.png',
'/tmp/some_dir/file_page_39.png',
'/tmp/some_dir/file_page_40.png',
'/tmp/some_dir/file_page_41.png',
'/tmp/some_dir/file_page_42.png',
'/tmp/some_dir/file_page_43.png',
...
]
Also @michaelfeil, I see you removed the "temp_dir" param from the zerox api, I still believe strongly that it is good to have in case someone wants to give a custom path in case for debugging or analysis. What we can do is set it to "None" by default which would use the "default_tmp_dir/
@tylermaran, once this PR is also finalized and merged, I can update PR https://github.com/getomni-ai/zerox/pull/21 to incorporate these changes as well, let me know what you think.
I work on this on my work hours, fixed the issue and was nice enough to open source it. In case this PR is not useful, please close it or merge it. I can't put more work into such a simple PR.
@pradhyumna85 Disagree. The function I implemented is fixed with a 6 digit padding that i introduced.
It creates a new dir, which solves the issue with cleanup / concurrent request. For debugging, you could use a Python debugger, breakpoint and analyze your tmp dir.
I work on this on my work hours, fixed the issue and was nice enough to open source it. In case this PR is not useful, please close it or merge it. I can't put more work into such a simple PR.
@pradhyumna85 Disagree. The function I implemented is fixed with a 6 digit padding that i introduced.
It creates a new dir, which solves the issue with cleanup / concurrent request. For debugging, you could use a Python debugger, breakpoint and analyze your tmp dir.
@michaelfeil My bad sorry, I missed the padding part with the usual sorting. Thanks for your work on this! I didn’t meant to offend you in any way with my suggestion—it was just an alternative idea which doesn't require padding altogether -> shorter filenames. Your solution works great, and I appreciate the effort you’ve put into it. For tmp dir it was just something preferential from my side but your solution also works.
Closes #25