Open mweidling opened 1 year ago
IIRC, at GT discussions it was said that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.
Maybe @tboenig can bring more light into this.
that these chars are normalized, as well as punctuations are tied to preceeding char without extra space.
that would be consistent with our GT transcription guidelines. This also ensures that the text at line level is a mere concatenation of the text at word level, interspersed by single spaces.
I would add the important special case of whitespace at the start and end of the line: these should be stripped.
The technical background for all this is that by principle, LSTMs cannot reliably (learn to) represent a sequence of white spaces, because there is nothing overt/visual that could be propagated. So forcing multiple whitespaces during training can be expected to make the models less robust – not only around whitespace, but also in general. And metrics in turn influence how models are made and evaluated.
I also think that
i.e.
sed -e 's,^\s*,,' -e 's,\s*$,,' -e 's,\s{2,}, ,g'
The specification currently makes no suggestion on how to deal with more than one consecutive white space character.