QA Specs: How to deal with consecutive white spaces

mweidling commented 1 year ago

The specification currently makes no suggestion on how to deal with more than one consecutive white space character.

M3ssman commented 1 year ago

IIRC, at GT discussions it was said that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

Maybe @tboenig can bring more light into this.

bertsky commented 1 year ago

that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.

that would be consistent with our GT transcription guidelines. This also ensures that the text at line level is a mere concatenation of the text at word level, interspersed by single spaces.

I would add the important special case of whitespace at the start and end of the line: these should be stripped.

The technical background for all this is that by principle, LSTMs cannot reliably (learn to) represent a sequence of white spaces, because there is nothing overt/visual that could be propagated. So forcing multiple whitespaces during training can be expected to make the models less robust – not only around whitespace, but also in general. And metrics in turn influence how models are made and evaluated.

kba commented 1 year ago

I also think that

consecutive whitespace should be normalized and
trailing/leading whitespace removed

i.e.

sed -e 's,^\s*,,' -e 's,\s*$,,' -e 's,\s{2,}, ,g'

OCR-D / spec

QA Specs: How to deal with consecutive white spaces #237