Closed dlazesz closed 3 years ago
That error happens in the conversion script in zim_to_corpus. Need to investigate it a bit more to understand how.
OK, I spoke too soon the last time; the error is not in the code, actually. \u200B
is the zero-width space character, which belongs to the Format (Cf
) category. These characters are not whitespaces, and therefore the usual way to get rid of whitespaces from string s
, ' '.join(s.split())
, keeps them in the text still.
To get rid of them, we should filter these from the input file. Care must be taken, because \p{C}
includes \n\t
amongst others, which obviously should be handled differently.
Turns out this issue only concerns the Wikipedia subcorpus (see the linked issue above). Closing the issue here, and will update the corpus via a one-time script.
There are strange replacement characters in the text which does not seem to present in the original source.
An exmple form the wiki part(wiki_0002.tsv.gz:9602):
Originally on: https://hu.wikipedia.org/wiki/Szt%C3%A1ray_Irma