UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

alto to text: too many spaces #129

Closed jbarth-ubhd closed 2 years ago

jbarth-ubhd commented 3 years ago

Example alto excerpt:

<TextLine><String CONTENT="Wappen:"/><SP/><String CONTENT="Heimstatt;"/><SP/><String CONTENT="Heimstatt,">... ...

converts to text

Wappen:␣␣Heimstatt;␣␣Heimstatt,␣␣Neipperg,␣␣Gemmingen ... ...
kba commented 3 years ago

You mean every <SP/> results in two spaces, not one?

jbarth-ubhd commented 3 years ago

I don't know where the two spaces exactly come from, but there should only 1 I'd say.

kba commented 3 years ago

ALTO-to-Text transformation is using @filak's XSLT (https://github.com/filak/hOCR-to-ALTO/blob/master/alto__text.xsl), this needs to be fixed upstream. Can you open an issue there as well pls?

jbarth-ubhd commented 3 years ago

opened: https://github.com/filak/hOCR-to-ALTO/issues/22

kba commented 3 years ago

fix: #130

stweil commented 2 years ago

So is this issue fixed and can it be closed?

kba commented 2 years ago

So is this issue fixed and can it be closed?

Yes, it has been fixed. If you have an older installation of ocr-fileformat (before feb 2021), you'll need to re-clone hOCR-to-ALTO:

rm vendor/hOCR-to-ALTO
make vendor install

(we should really use git submodules to make tracking changes and updating easier)