UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

[feature request] Support MacOS #150

Closed stweil closed 1 year ago

stweil commented 2 years ago

The current bash scripts contain code which does not work on MacOS out of the box (incompatible usage of sed, associative arrays, maybe more). Users are forced to install newer versions of bash and sed (which might be undesired) to run it.

Perhaps all bash scripts should be replaced by Python3 scripts. python3 is already used in the code, and using it everywhere might even simplify the code. At least it would be portable. It would even be possible to provide ocr-fileformat in the Python Package Index PyPI.

bertsky commented 1 year ago

@stweil could you please elaborate what kinds of effects this causes, or which scripts are affected?

@sarepal (on her Mac) just get no output and no error from ocr-transform, but I'm not sure it's the same problem. (Also, --debug does not change anything.)

EDIT: (The actual PageConverter Java run in her case does work.)

bertsky commented 1 year ago

Oh, lib.sh without associative arrays would be extremely difficult, so basically everything is affected.

I guess the real issue then is what Mac users must do to get a recent bash?

stweil commented 1 year ago

Sure, installing a newer bash is an option, but not a good one because it increases the complexity for less experienced users. Ideally a pip install would work, and that can only use the preinstalled bash.

bertsky commented 1 year ago

Ok, but so far this is not a Python tool/repo at all. That's basically asking for a complete rewrite.

Shouldn't there be some better mechanism for isolation, like installing a newer bash only as user and adding it to the PATH temporarily?

sarepal commented 1 year ago

I have bash 5.2.15 installed through Homebrew. Upgrading it through pip did not resolve the issue.

sarepal commented 1 year ago

Also, when I run the sudo make install, I get this output: mkdir -p /usr/local/bin sed '/^SHAREDIR=/c SHAREDIR="/usr/local/share/ocr-fileformat"' bin/ocr-transform.sh | \ sed "s/VERSION/v0.5.0-2-g85c6325/" > /usr/local/bin/ocr-transform sed: 1: "/^SHAREDIR=/c SHAREDIR= ...": command c expects \ followed by text sed '/^SHAREDIR=/c SHAREDIR="/usr/local/share/ocr-fileformat"' bin/ocr-validate.sh | \ sed "s/VERSION/v0.5.0-2-g85c6325/" > /usr/local/bin/ocr-validate sed: 1: "/^SHAREDIR=/c SHAREDIR= ...": command c expects \ followed by text chmod a+x /usr/local/bin/ocr-transform /usr/local/bin/ocr-validate find /usr/local/share/ocr-fileformat -exec chmod u+w {} \;

Is this a sed version issue?

stweil commented 1 year ago

Yes, sed also must to be replaced by a newer version (see my initial report above).

sarepal commented 1 year ago

Do you know what version of sed this tool requires? I have used https://medium.com/@bramblexu/install-gnu-sed-on-mac-os-and-set-it-as-default-7c17ef1b8f64 to brew install gnu-sed and export its path for use as sed.

bertsky commented 1 year ago

Current is v 4.4, but I doubt the exact version matters (since the c command has been established very long ago).

Did you verify that the GNU version is replacing the BSD version? (What does --version say? And did you try to reinstall after that?)

sarepal commented 1 year ago

This what it says for version: % sed --version sed (GNU sed) 4.9 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later https://gnu.org/licenses/gpl.html. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini, Paolo Bonzini, Jim Meyering, and Assaf Gordon.

This sed program was built without SELinux support.

GNU sed home page: https://www.gnu.org/software/sed/. General help using GNU software: https://www.gnu.org/gethelp/. E-mail bug reports to: bug-sed@gnu.org.

When I reinstall the program, I get the same output about sed as the earlier comment.

stweil commented 1 year ago

So you got the sed from homebrew's gnu-sed. The MacOS sed is too old to support sed --version.

stweil commented 1 year ago

The right solution for getting MacOS support is removing any dependency on sed and removing the associative arrays in shell code. As long as that is not implemented, getting bash and gnu-sed from Homebrew (which provides recent versions) should help.

Homebrew installs gsed, so either link or copy that to sed or add PATH="$(brew --prefix)/opt/gnu-sed/libexec/gnubin:$PATH" to get the new sed.

bertsky commented 1 year ago

So you got the sed from homebrew's gnu-sed. The MacOS sed is too old to support sed --version.

The output states it's GNU.

The right solution for getting MacOS support is removing any dependency on sed and removing the associative arrays in shell code.

Why should we wreck all the shell scripts if we simply have a problem with that sed call?

As long as that is not implemented, getting bash and gnu-sed from Homebrew (which provides recent versions) should help.

The version already is GNU, so there is something else going on.

I checked back with the documentation and really this seems like a wrong command syntax. The c command must be followed by a newline. I don't know why until now it worked with a space instead, but that may be accidental. I'll make a PR.