Closed dlazesz closed 3 years ago
@dlazesz I have regenerated the Wikipedia subcorpus. Could you check if the new corpus still contains elements like that?
Thank you, but there are still some erroneous paragraphs:
wiki_0007.tsv.gz:631377
wiki_0009.tsv.gz:684145
wiki_0010.tsv.gz:669053
wiki_0016.tsv.gz:701968
wiki_0020.tsv.gz:647101
wiki_0020.tsv.gz:647143
wiki_0026.tsv.gz:777295
wiki_0051.tsv.gz:237695
wiki_0059.tsv.gz:1003227
wiki_0065.tsv.gz:645317
wiki_0065.tsv.gz:645337
wiki_0076.tsv.gz:1162753
wiki_0085.tsv.gz:211427
wiki_0093.tsv.gz:656502
wiki_0123.tsv.gz:744510
wiki_0123.tsv.gz:992597
wiki_0137.tsv.gz:3304
wiki_0149.tsv.gz:648077
wiki_0154.tsv.gz:943065
wiki_0158.tsv.gz:491410
(This is the full list.)
I could post my code for you to be able to check it yourself and you may include it to this repo.
Actually, this list is only complete if we only consider empty paragraphs; for instance page Lábjegyzet is completely empty. In any case, I am closing this issue as it has nothing to do with the CC corpus, and much more with this bug. Thanks for reporting and the list as well. :smile:
There are empty elements (containing zero tokens) in the corpus which should be omitted, as they add no actual data to the corpus.
Some examples: