attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

fails on the first file #292

Open vsraptor opened 2 years ago

vsraptor commented 2 years ago

INFO: Preprocessed 22100000 pages INFO: Preprocessed 22200000 pages INFO: Loaded 738901 templates in 4795.6s INFO: Starting page extraction from enwiki-latest-pages-articles.xml.bz2. INFO: Using 7 extract processes. Process ForkProcess-1: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process output.write(ordering_buffer.pop(next_ordinal)) File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 173, in write self.file.write(data) File "/usr/lib/python3.8/bz2.py", line 245, in write compressed = self._compressor.compress(data) TypeError: a bytes-like object is required, not 'str'


    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data)
        else:
            self.file.write(data)

should be :

    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data.encode('utf8'))
        else:
            self.file.write(data)
rxm commented 1 year ago

I had the same issue and @vsraptor 's modification fixed it for me. Thank you for posting.

As you are in that file you could also replace Bzip2 output compression to Gzip (adding an import gzip under the import bz2 line). I try to work with compressed files downstream, and Gzip files are significantly faster to deal with at a small price in size.

def open(self, filename):
        if self.compress:
            # return bz2.BZ2File(filename + '.bz2', 'w')
            return gzip.GzipFile(filename + '.gz', mode='w')
        else:
            return open(filename, 'w')
rxm commented 1 year ago

I noticed that the last compressed file created (as given by NextFile) when using the --compressed flag is incomplete. I have tried flush, closes, and scattered sleeps but I have not yet found where the problem is (this using bz2 compression). Any ideas?

I instrumented the OutputSplitter class and found that OutputSplitter.close() is not called for the last file. There are also a few extra writes to the last file. Wikiextractor is a multiprocess script that has several processes reading the dump and one reduce_process writing the results. If it runs out of things to write it terminates and leaves it to the calling process to close the OutputSplitter object, but at that point they are different. Adding an output.close() to the bottom of reduce_process closes the currently open file.

It also works when using gzip.GzipFile.