Open todpole3 opened 5 years ago
I was able to surpass this issue using by catching the exception in file /opt/conda/lib/python3.7/_compression.py
.
No, my solution above does not work as expected. Since the exception happens when the decompression library was constructing the data stream, catching the exception simply let it ignore what's left in the data stream. And I was still only able to get ~237M tokens in my en.all
file.
It would be great if you could report the size of the decompressed (and tokenized) en.all
file you have such that I could know the correct file size to expect. Thanks.
I was able to surpass this issue using by catching the exception in file
/opt/conda/lib/python3.7/_compression.py
.
I also met this problem. Could you give more details about how you catch the exception? (Now I can only get 175993 lines and then get the error like OSError: Invalid data stream
)
Were you able to fix that issue? The English text file I got from wikiextractor was 12G and 41.9M lines.
Were you able to fix that issue? The English text file I got from wikiextractor was 12G and 41.9M lines.
Sorry for late reply. Yes, i used the wiki dump of the version 20190601
and it worked. So i think the format of the latest wiki dump file may have some problems and it just didn't fit with the processing script.
I encounter the following error processing the latest English Wikipedia dump using the
get-data-wiki.sh
script.I was able to extract an
en.all
file with 3.07M lines and 237M tokens. According to the paper, the entire Wikipedia extraction should contain ~2500M tokens.I suspect this is a bug in the
WikiExtractor
tool or an issue with the latest data dump, but just to post it here in case anyone has worked out a way to resolve it.If you know a specific Wiki dump version that the tool can successfully process, please share.
And would you please also share the expected size of a successfully extracted
en.all
(i.e. # lines? # tokens?)