KBNLresearch / ochre

Toolbox for OCR post-correction
Apache License 2.0
122 stars 18 forks source link

All chars assumption #6

Open omrishsu opened 6 years ago

omrishsu commented 6 years ago

Hi, The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step. Is it ok? Or this is something that needs to be addressed?

Thanks! Omri

jvdzwaan commented 6 years ago

Actually, the chars are extracted from all text (train set, test set, and val set).

Whether this is correct (fair) is open for discussion. It is probably more correct to use only the characters in the train set (and maybe validation set) and have an 'unknown' character. It is likely that the 'unknown' character only appears in the input text, and not in the output text. Otherwise incorrect text will be produced.

omrishsu commented 6 years ago

I've solved this issue by adding another param with chars to include.

BTW, do you want me to contribute these changes? I fill like it is very specific to my needs, but if you like...