OSError: Invalid data stream when processing English Wiki dump

todpole3 commented 5 years ago

I encounter the following error processing the latest English Wikipedia dump using the get-data-wiki.sh script.

I was able to extract an en.all file with 3.07M lines and 237M tokens. According to the paper, the entire Wikipedia extraction should contain ~2500M tokens.

I suspect this is a bug in the WikiExtractor tool or an issue with the latest data dump, but just to post it here in case anyone has worked out a way to resolve it.

If you know a specific Wiki dump version that the tool can successfully process, please share.

And would you please also share the expected size of a successfully extracted en.all(i.e. # lines? # tokens?)

./get-data-wiki.sh en
*** Cleaning and tokenizing en Wikipedia dump ... ***
Tokenizer Version 1.1
Language: en
Number of threads: 8
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:2450: DeprecationWarning: Flags not at the start of the expression '\\[(((?i)bitcoin:|ftp' (truncated)
  re.S | re.U)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:2457: DeprecationWarning: Flags not at the start of the expression '^(http://|https://)(' (truncated)
  re.X | re.S | re.U)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Book of Mormon' (3978): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Donald Rumsfeld' (8629): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Francis Ford Coppola' (10576): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Geography of India' (14597): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Jacob Neusner' (16452): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Joan Baez' (50960): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Édith Piaf' (64963): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Jessica Lange' (67763): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Ooty' (69973): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Modifier key' (77266): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Brigham Young University' (82058): title(3) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'London Borough of Enfield' (94318): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Coltan' (95913): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Terry Branstad' (99629): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Torrance, California' (107690): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Hollywood, Florida' (109038): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Barrett, Minnesota' (120108): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Boothwyn, Pennsylvania' (132404): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Harmonic series (mathematics)' (142488): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Blitter' (145474): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Speed of sound' (147853): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'John Mills' (153655): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Numerical integration' (170089): title(2) recursion(0, 0, 0)
WARNING: Template errors in article 'Strange Days (album)' (177614): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Harmonic number' (214729): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Florida State University' (239297): title(1) recursion(0, 0, 0)
WARNING: Template errors in article 'Inline speed skating' (261635): title(12) recursion(0, 0, 0)
WARNING: Template errors in article 'Electronic business' (264466): title(1) recursion(0, 0, 0)
/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py:663: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  self.title, self.id, *errs)
WARNING: Template errors in article 'Ju' (301643): title(1) recursion(0, 0, 0)
Traceback (most recent call last):
  File "/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py", line 3296, in <module>
    main()
  File "/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py", line 3282, in main
    args.compress, args.processes)
  File "/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py", line 2967, in process_dump
    for page_data in pages_from(input):
  File "/export/home/xlin/Projects/XLM/tools/wikiextractor/WikiExtractor.py", line 2802, in pages_from
    for line in input:
  File "/opt/conda/lib/python3.7/fileinput.py", line 252, in __next__
    line = self._readline()
  File "/opt/conda/lib/python3.7/bz2.py", line 215, in readline
    return self._buffer.readline(size)
  File "/opt/conda/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/opt/conda/lib/python3.7/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
OSError: Invalid data stream

todpole3 commented 5 years ago

I was able to surpass this issue using by catching the exception in file /opt/conda/lib/python3.7/_compression.py.

todpole3 commented 5 years ago

No, my solution above does not work as expected. Since the exception happens when the decompression library was constructing the data stream, catching the exception simply let it ignore what's left in the data stream. And I was still only able to get ~237M tokens in my en.all file.

It would be great if you could report the size of the decompressed (and tokenized) en.all file you have such that I could know the correct file size to expect. Thanks.

alwayshjia commented 5 years ago

I was able to surpass this issue using by catching the exception in file /opt/conda/lib/python3.7/_compression.py.

I also met this problem. Could you give more details about how you catch the exception? (Now I can only get 175993 lines and then get the error like OSError: Invalid data stream)

aconneau commented 5 years ago

Were you able to fix that issue? The English text file I got from wikiextractor was 12G and 41.9M lines.

alwayshjia commented 4 years ago

Were you able to fix that issue? The English text file I got from wikiextractor was 12G and 41.9M lines.

Sorry for late reply. Yes, i used the wiki dump of the version 20190601 and it worked. So i think the format of the latest wiki dump file may have some problems and it just didn't fit with the processing script.

facebookresearch / XLM

OSError: Invalid data stream when processing English Wiki dump #187