DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Strange replacement characters in the text #18

Closed dlazesz closed 3 years ago

dlazesz commented 3 years ago

There are strange replacement characters in the text which does not seem to present in the original source.

An exmple form the wiki part(wiki_0002.tsv.gz:9602):

# newpar id = s3-u2-l1
# text = Sztáray Irma: Erzsébet �királyné kíséretében (1909)
Sztáray    " "    Sztáray    [/N][Nom]
Irma    ""    Irma    [/N][Nom]
:    " "    :    [Punct]
Erzsébet    " "    Erzsébet    [/N][Nom]
�    ""    �    [/Adj][Nom]
királyné    " "    királyné    [/N][Nom]
kíséretében    " "    kíséret    [/N][Poss.3Sg][Ine]
(    ""    (    [Punct]
1909)    "\n\n"    1909    [/Num|Digit][Nom][Punct]

Originally on: https://hu.wikipedia.org/wiki/Szt%C3%A1ray_Irma

DavidNemeskey commented 3 years ago

That error happens in the conversion script in zim_to_corpus. Need to investigate it a bit more to understand how.

DavidNemeskey commented 3 years ago

OK, I spoke too soon the last time; the error is not in the code, actually. \u200B is the zero-width space character, which belongs to the Format (Cf) category. These characters are not whitespaces, and therefore the usual way to get rid of whitespaces from string s, ' '.join(s.split()), keeps them in the text still.

To get rid of them, we should filter these from the input file. Care must be taken, because \p{C} includes \n\t amongst others, which obviously should be handled differently.

DavidNemeskey commented 3 years ago

Turns out this issue only concerns the Wikipedia subcorpus (see the linked issue above). Closing the issue here, and will update the corpus via a one-time script.