UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

[WIP] Fix page__alto and alto__page #105

Closed zuphilip closed 4 years ago

zuphilip commented 4 years ago

This is work in progress and I may need some help. Currently the transformations alto2page and page2alto are present but not everything is producing some reasonable output:

Do not yet merge this, as we need first to finish this PR together. CC @kba @stweil

stweil commented 4 years ago

In the web GUI any of these transformations leads to an empty file.

The web interface uses STDIN and STDOUT to pass data to and from ocr-transform. So it runs ocr-transform page alto - - for example. This does not work because the current scripts handle output to STDOUT, but don't support input from STDIN. They expect a real file.

zuphilip commented 4 years ago

Okay, I get some results by adding

if [[ "$3" = "-" ]];then
    #echo "$0 $1 $2 $3"
    INFILE="$(mktemp)"
    cp /dev/stdin "$INFILE"
fi

but I don't understand why the order of variables change ($3 vs $1). Moreover, I am not sure that this is the way to go. Any advice?

stweil commented 4 years ago

Based on your work I created a new PR #106 which should work now. I already installed it, so you can try the web interface.

zuphilip commented 4 years ago

Superseeded by #106.