DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Empty elements (sentence, paragraph, document) should be omitted from the corpus #19

Closed dlazesz closed 3 years ago

dlazesz commented 3 years ago

There are empty elements (containing zero tokens) in the corpus which should be omitted, as they add no actual data to the corpus.

Some examples:

wiki_0002.tsv.gz:747147
wiki_0014.tsv.gz:619965
wiki_0017.tsv.gz:783317
wiki_0026.tsv.gz:802087
wiki_0027.tsv.gz:638946
wiki_0072.tsv.gz:635924
wiki_0084.tsv.gz:668839
wiki_0104.tsv.gz:573378
wiki_0104.tsv.gz:573420
wiki_0105.tsv.gz:598034
wiki_0105.tsv.gz:598054
wiki_0120.tsv.gz:707838
wiki_0121.tsv.gz:651219
DavidNemeskey commented 3 years ago

@dlazesz I have regenerated the Wikipedia subcorpus. Could you check if the new corpus still contains elements like that?

dlazesz commented 3 years ago

Thank you, but there are still some erroneous paragraphs:

wiki_0007.tsv.gz:631377
wiki_0009.tsv.gz:684145
wiki_0010.tsv.gz:669053
wiki_0016.tsv.gz:701968
wiki_0020.tsv.gz:647101
wiki_0020.tsv.gz:647143
wiki_0026.tsv.gz:777295
wiki_0051.tsv.gz:237695
wiki_0059.tsv.gz:1003227
wiki_0065.tsv.gz:645317
wiki_0065.tsv.gz:645337
wiki_0076.tsv.gz:1162753
wiki_0085.tsv.gz:211427
wiki_0093.tsv.gz:656502
wiki_0123.tsv.gz:744510
wiki_0123.tsv.gz:992597
wiki_0137.tsv.gz:3304
wiki_0149.tsv.gz:648077
wiki_0154.tsv.gz:943065
wiki_0158.tsv.gz:491410

(This is the full list.)

I could post my code for you to be able to check it yourself and you may include it to this repo.

DavidNemeskey commented 3 years ago

Actually, this list is only complete if we only consider empty paragraphs; for instance page Lábjegyzet is completely empty. In any case, I am closing this issue as it has nothing to do with the CC corpus, and much more with this bug. Thanks for reporting and the list as well. :smile: