DanBloomberg / leptonica

Leptonica is an open source library containing software that is broadly useful for image processing and image analysis applications. The official github repository for Leptonica is: danbloomberg/leptonica. See leptonica.org for more documentation.
Other
1.72k stars 384 forks source link

Regression: file not found on MacOS when opening /tmp file #735

Open yonran opened 5 months ago

yonran commented 5 months ago

Starting 05398d6c593893c4ee9706002218354558513e9a 1.84.0, on darwin MacOS, leptonica gives an error when opening a file in /tmp. Also, the error message does not give the actual path that it tried to open. For example, here is a program (based on tesseract.cpp):

#include <allheaders.h>

int main(int argc, char* argv[]) {
    const char* image = "/tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png";
    struct Pix *pixs = pixRead(image);
    if (!pixs) {
      fprintf(stderr, "Leptonica can't process input file: %s\n", image);
      return 2;
    }
    return 0;
}

It gives this output:

Leptonica Error in fopenReadStream: file not found: 000011_ocr.png
Leptonica Error in pixRead: image file not found: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png
Leptonica can't process input file: /tmp/ocrmypdf.io.uss3ldn7/000011_ocr.png

This affects ocrmypdf when TMPDIR=/tmp, which uses tesseract, which calls leptonica:

nix-shell -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/4b8e9717fac859f830fa318a0cc1e2d4a40df152.tar.gz -p ocrmypdf --run 'ocrmypdf --redo-ocr --verbose=1 --keep-temporary-files ~/Downloads/20231017_TransferTaxExemptionMeasure.pdf ~/Downloads/20231017_TransferTaxExemptionMeasure-ocr.pdf'
…
    1 Running: ['/nix/store/pgz54swxlbxc2lxx23ramcfz099v7n6z-tesseract-5.3.3/bin/tesseract', '-l', 'eng', '-c', 'textonly_pdf=1',   __init__.py:134
'/tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png', '/tmp/ocrmypdf.io.xu77l3_5/000001_ocr_tess', 'pdf', 'txt']                                             
    1  Leptonica Error in fopenReadStream: file not found: 000001_ocr.png                                                          tesseract.py:252
    1  Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.xu77l3_5/000001_ocr.png                           tesseract.py:252
    1  Leptonica Error in fopenReadStream: file not found: PNG                                                                     tesseract.py:252
    1  Leptonica Error in pixRead: image file not found: PNG                                                                       tesseract.py:252

(note: https://github.com/NixOS/nixpkgs/commit/4b8e9717fac859f830fa318a0cc1e2d4a40df152 is the first commit that contains both the https://github.com/NixOS/nixpkgs/commit/628b90b5ad0a526dba2daeb17d07ce248f0c5275 and a fix for an unrelated error “Abort trap: 6 mutool -v” https://github.com/NixOS/nixpkgs/commit/11498aed21cfdc45e93d8243e6458d8883d45214 )

Workaround: Set TMPDIR=/private/tmp instead of /tmp before invoking ocrmypdf

DanBloomberg commented 5 months ago

@stweil

I remember a recent proposal to allow TMPDIR path rewrites for MacOS, but I believe it was shelved. This has been an issue for quite a while. We solved it for Windows by allowing path rewrites and universally using genPathname() and fopenReadStream(). These packaging issues are of course well above my pay grade.

Yonathan also points out that fopenReadStream() is not giving the path when it can't open the file locally. We can give more information at that failure point; e.g. replace line 1896 by

        lept_stderr("Failed in %s to open locally with tail %s " 
                    "for filename %s\n", __func__, tail, filename);
DanBloomberg commented 4 months ago

Oops, one should always use the error macros for error messages, not lept_stderr

        L_ERROR("failed to open locally with tail %s for filename %s\n",
                __func__, tail, filename);