Closed jbarth-ubhd closed 2 years ago
You mean every <SP/>
results in two spaces, not one?
I don't know where the two spaces exactly come from, but there should only 1 I'd say.
ALTO-to-Text transformation is using @filak's XSLT (https://github.com/filak/hOCR-to-ALTO/blob/master/alto__text.xsl), this needs to be fixed upstream. Can you open an issue there as well pls?
fix: #130
So is this issue fixed and can it be closed?
So is this issue fixed and can it be closed?
Yes, it has been fixed. If you have an older installation of ocr-fileformat (before feb 2021), you'll need to re-clone hOCR-to-ALTO:
rm vendor/hOCR-to-ALTO
make vendor install
(we should really use git submodules to make tracking changes and updating easier)
Example alto excerpt:
converts to text