axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.76k stars 306 forks source link

Ghostscript library not found on MacOS bare metal installation of Parsr 1.2.2 #583

Open slbayer opened 2 years ago

slbayer commented 2 years ago

Summary On MacOS 11.6.4, using Parsr 1.2.2, the camelot package cannot find the Ghostscript library installed by macports. I have good reason to believe that homebrew has the same problem.

Steps To Reproduce

Expected behavior

You should find this in the log:

Table detection succeed

Actual behavior

You'll find this in the log:

    raise OSError(
OSError: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html

Environment

Additional context

This bug is raised in the camelot package, as it looks for the Ghostscript library, named gs. It calls ctypes.util.find_library("gs"), which searches in a number of locations, including those indicated by the environment variables DYLD_LIBRARY_PATH and DYLD_FALLBACK_LIBRARY_PATH. There are two problems.

First, on MacOS, the Parsr installation documents recommend using homebrew to install the Ghostscript package, and my colleagues tell me that homebrew does not set any of the relevant environment variables. Neither does macports. The PATH is updated, but not the library path variables. So, in the default situation, camelot won't have a chance to find these libraries.

Second, something in the overall npm invocation blocks the percolation of these libraries when they are set. So doing this:

$ DYLD_FALLBACK_LIBRARY_PATH=/opt/local/lib npm run start:api

doesn't work either; by the time the environment reaches the detectTables() function in CommandExecuter.ts, the environment variable is already undefined. What might be going on here is that on MacOS, the System Integrity Protection will prevent the DYLD_ environment variables from being percolated through various sensitive calls, one of which is sh. So if a shell is wrapped around a command line invocation, then the environment variable is lost. Here's an illustration in Python:

>>> import os, subprocess
>>> os.environ["DYLD_FALLBACK_LIBRARY_PATH"] = "/opt/local/lib"
>>> import ctypes.util
>>> ctypes.util.find_library("gs")
'/opt/local/lib/libgs.dylib'
>>> subprocess.run(["python", "-c", "import ctypes.util; print(ctypes.util.find_library('gs'))"])
/opt/local/lib/libgs.dylib
>>> subprocess.run("python -c \"import ctypes.util; print(ctypes.util.find_library('gs'))\"", shell=True)
None

However, the problem is not that the subcommand invoked within detectTables() is invoking a shell and stripping the variable; the problem is farther up, since the environment variable is already stripped when detectTables() is invoked. I suspect that somehow, the cascade of commands that npm causes to happen is invoking a shell and causing the environment to be stripped.

If I set the environment variable directly in detectTables() before the Python command is run, and rerun the npm command, the problem goes away.