UB-Mannheim / ocr-gt-tools

Ergonomic line-by-line transcription of scanned text.
GNU Affero General Public License v3.0
47 stars 11 forks source link

Transcription of spaces #66

Closed kba closed 8 years ago

kba commented 8 years ago

E.g. header line in http://digi.bib.uni-mannheim.de/fileadmin/digi/445441798/max/445441798_0016.jpg

14␣Tractatus

vs

14␣␣␣␣␣␣␣␣␣␣␣Tractatus
zuphilip commented 8 years ago

Our rules say that that doesn't matter whether there are 1 or many spaces. We should suggest maybe another kind of comparison in ocropus-errs which will delete multiple spaces (but let 1 space intact), i.e.

    if kind=="nomultiplespace":
        return re.sub(ur'\s+','\s',s)

in https://github.com/tmbdev/ocropy/blob/master/ocrolib/common.py#L126

kba commented 8 years ago

@zuphilip Good idea. Maybe create an issue in ocropy to keep track of it?

zuphilip commented 8 years ago

I created an issue in ocropus repo: https://github.com/tmbdev/ocropy/issues/98