Closed eaglemc closed 3 years ago
Can you recreate the situation where this issue happens? Has it happened often to you? Just once ever?
Well I've only converted converted two documents. So while it happened in 50% of the cases that's not statistically meaningful. However now that I check it is actually happening with the test1.pdf document from the repository, so there may be something machine-specific or just bad luck in the mix. I'm using Linux Mint 18.3 in a VMware virtual machine running on Windows 10...
Looking at the set -x
output:
+ file_list_path=pdf2searchablepdf_temp_20200924-073426.461439419/file_list.txt
+ sort -V
+ find pdf2searchablepdf_temp_20200924-073426.461439419/file_list.txt pdf2searchablepdf_temp_20200924-073426.461439419/pg-1.tif pdf2searchablepdf_temp_20200924-073426.461439419/pg-2.tif pdf2searchablepdf_temp_20200924-073426.461439419/pg-3.tif
+ echo 'Running tesseract OCR on all generated TIF images in the temporary working directory.'
Running tesseract OCR on all generated TIF images in the temporary working directory.
It looks to me like the sort
process is started first so the output can be piped to it, and the redirection means that the file_list.txt file must be created at that time. And then when find
runs it would make sense it finds the file_list.txt file. Except that makes it seems like it should have never worked, and clearly it did. But find
isn't finding the file, bash
is when it globs the asterisk, so maybe there's some difference in how that's behaving for me versus how it may have behaved in the past?
My bash version is GNU bash, version 4.3.48(1)-release (i686-pc-linux-gnu)
@eaglemc , thanks for the details. Here's my bash version on Ubuntu 20.04:
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
That could be the difference. I also wonder if your VM is a difference. And, I wonder if my super-fast SSD is the difference. Maybe my rm -f "$file_list_path"
cmd called just before find "$dir_of_imgs"/* | sort -V > "$file_list_path"
is completely finished prior to the find
cmd running, since my SSD is so fast, but your Virtual Machine is executing those in parallale, and rm
hasn't completed for you when find
starts. I don't have a perfect understanding of bash parallelism and when things are in series vs in parallel.
In either case, you've pointed out a problem, and some potential solutions, and there is an open PR on it, so one way or another I'll accept and/or make some changes to hopefully address this. I like your suggestions in your first post too. And, you just taught me about set -x
, and set +x
to turn it off, which I'm adding in my notes in this file now too: https://github.com/ElectricRCAircraftGuy/eRCaGuy_dotfiles/blob/master/git%20%26%20Linux%20cmds%2C%20help%2C%20tips%20%26%20tricks%20-%20Gabriel.txt.
After some research, this seems to work the best, where test_imgs
is a test directory:
find test_imgs -maxdepth 1 -mindepth 1 -not -type d -not -name '*.txt' | sort --version-sort
References:
Fixed by #10 and commit 9b2855a.
@eaglemc , please let me know if the issue is resolved. If not, please reopen this issue. Thanks!
I'm not really sure how this happens, but I encountered an issue where the file_list.txt file includes a line for file_list.txt itself. Tesseract then chokes on this as an unrecognized file format.
To resolve this I changed the line
find "$temp_dir"/* | sort -V > "$file_list_path"
tofind "$temp_dir" -name *.tif | sort -V > "$file_list_path"
I did notice using
set -x
that with the original command it seems thatbash
is processing the*
and passing a giant list of names tofind
, while in the modified commandfind
is handling the wildcard.It also might be better to use something like
find "$temp_dir" -not -name *.txt | sort -V > "$file_list_path"
in case the script were ever modified to use some other image file format in the future.