Open vsraptor opened 2 years ago
I had the same issue and @vsraptor 's modification fixed it for me. Thank you for posting.
As you are in that file you could also replace Bzip2 output compression to Gzip (adding an import gzip
under the import bz2
line). I try to work with compressed files downstream, and Gzip files are significantly faster to deal with at a small price in size.
def open(self, filename):
if self.compress:
# return bz2.BZ2File(filename + '.bz2', 'w')
return gzip.GzipFile(filename + '.gz', mode='w')
else:
return open(filename, 'w')
I noticed that the last compressed file created (as given by NextFile
) when using the --compressed
flag is incomplete. I have tried flush, closes, and scattered sleeps but I have not yet found where the problem is (this using bz2 compression). Any ideas?
I instrumented the OutputSplitter
class and found that OutputSplitter.close()
is not called for the last file. There are also a few extra writes to the last file. Wikiextractor is a multiprocess script that has several processes reading the dump and one reduce_process
writing the results. If it runs out of things to write it terminates and leaves it to the calling process to close the OutputSplitter object, but at that point they are different. Adding an output.close()
to the bottom of reduce_process
closes the currently open file.
It also works when using gzip.GzipFile
.
INFO: Preprocessed 22100000 pages INFO: Preprocessed 22200000 pages INFO: Loaded 738901 templates in 4795.6s INFO: Starting page extraction from enwiki-latest-pages-articles.xml.bz2. INFO: Using 7 extract processes. Process ForkProcess-1: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process output.write(ordering_buffer.pop(next_ordinal)) File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 173, in write self.file.write(data) File "/usr/lib/python3.8/bz2.py", line 245, in write compressed = self._compressor.compress(data) TypeError: a bytes-like object is required, not 'str'
should be :