ElectricRCAircraftGuy / PDF2SearchablePDF

`pdf2searchablepdf input.pdf` = voila! "input_searchable.pdf" is created & now has searchable text!
MIT License
127 stars 14 forks source link

file_list.txt included in file_list.txt #9

Closed eaglemc closed 3 years ago

eaglemc commented 4 years ago

I'm not really sure how this happens, but I encountered an issue where the file_list.txt file includes a line for file_list.txt itself. Tesseract then chokes on this as an unrecognized file format.

To resolve this I changed the line find "$temp_dir"/* | sort -V > "$file_list_path" to find "$temp_dir" -name *.tif | sort -V > "$file_list_path"

I did notice using set -x that with the original command it seems that bash is processing the * and passing a giant list of names to find, while in the modified command find is handling the wildcard.

It also might be better to use something like find "$temp_dir" -not -name *.txt | sort -V > "$file_list_path" in case the script were ever modified to use some other image file format in the future.

ElectricRCAircraftGuy commented 4 years ago

Can you recreate the situation where this issue happens? Has it happened often to you? Just once ever?

eaglemc commented 4 years ago

Well I've only converted converted two documents. So while it happened in 50% of the cases that's not statistically meaningful. However now that I check it is actually happening with the test1.pdf document from the repository, so there may be something machine-specific or just bad luck in the mix. I'm using Linux Mint 18.3 in a VMware virtual machine running on Windows 10...

Looking at the set -x output:

+ file_list_path=pdf2searchablepdf_temp_20200924-073426.461439419/file_list.txt
+ sort -V
+ find pdf2searchablepdf_temp_20200924-073426.461439419/file_list.txt pdf2searchablepdf_temp_20200924-073426.461439419/pg-1.tif pdf2searchablepdf_temp_20200924-073426.461439419/pg-2.tif pdf2searchablepdf_temp_20200924-073426.461439419/pg-3.tif
+ echo 'Running tesseract OCR on all generated TIF images in the temporary working directory.'
Running tesseract OCR on all generated TIF images in the temporary working directory.

It looks to me like the sort process is started first so the output can be piped to it, and the redirection means that the file_list.txt file must be created at that time. And then when find runs it would make sense it finds the file_list.txt file. Except that makes it seems like it should have never worked, and clearly it did. But find isn't finding the file, bash is when it globs the asterisk, so maybe there's some difference in how that's behaving for me versus how it may have behaved in the past? My bash version is GNU bash, version 4.3.48(1)-release (i686-pc-linux-gnu)

ElectricRCAircraftGuy commented 3 years ago

@eaglemc , thanks for the details. Here's my bash version on Ubuntu 20.04:

$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

That could be the difference. I also wonder if your VM is a difference. And, I wonder if my super-fast SSD is the difference. Maybe my rm -f "$file_list_path" cmd called just before find "$dir_of_imgs"/* | sort -V > "$file_list_path" is completely finished prior to the find cmd running, since my SSD is so fast, but your Virtual Machine is executing those in parallale, and rm hasn't completed for you when find starts. I don't have a perfect understanding of bash parallelism and when things are in series vs in parallel.

In either case, you've pointed out a problem, and some potential solutions, and there is an open PR on it, so one way or another I'll accept and/or make some changes to hopefully address this. I like your suggestions in your first post too. And, you just taught me about set -x, and set +x to turn it off, which I'm adding in my notes in this file now too: https://github.com/ElectricRCAircraftGuy/eRCaGuy_dotfiles/blob/master/git%20%26%20Linux%20cmds%2C%20help%2C%20tips%20%26%20tricks%20-%20Gabriel.txt.

ElectricRCAircraftGuy commented 3 years ago

After some research, this seems to work the best, where test_imgs is a test directory:

find test_imgs -maxdepth 1 -mindepth 1 -not -type d -not -name '*.txt' | sort --version-sort

References:

  1. https://alvinalexander.com/linux-unix/linux-find-files-not-matching-filename-pattern-dont-match/
ElectricRCAircraftGuy commented 3 years ago

Fixed by #10 and commit 9b2855a.

@eaglemc , please let me know if the issue is resolved. If not, please reopen this issue. Thanks!