attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.75k stars 967 forks source link

EOFError: Ran out of input #242

Open shidaide2019 opened 3 years ago

shidaide2019 commented 3 years ago

Sorry to disturb you, but I met a weird bug while extracting wiki bz2 My python version is 3.8, and anaconda version id 2020.11, I used pip install to get wikiextractor(3.0.4) and when I ran command

python -m wikiextractor.WikiExtractor -o extracted enwiki-20201220-pages-articles-multistream.xml.bz2

It comes out such error message after about 50 mins running:

Traceback (most recent call last): File "C:\Users\win\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\win\Anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 621, in main() File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 616, in main process_dump(input_file, args.templates, output_path, file_size, File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 357, in process_dump reduce.start() File "C:\Users\win\Anaconda3\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\win\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\win\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: cannot pickle '_io.TextIOWrapper' object Traceback (most recent call last): File "", line 1, in File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

I'm looking forward to your answer.

shidaide2019 commented 3 years ago

At first I think it maybe result from multiprocessing ,so I changed the processes to 1 and the error is the same as multiprocess.

runpingzhong commented 3 years ago

i met the same problem,have you solved it?

attardi commented 3 years ago

I think there is a problems on Windows, passing file descriptors across threads. It would require some rewriting in order to open descriptors within threads.

ArlanCooper commented 3 years ago

I have the same problem on Windows ,too, how to solve it?

number435398 commented 1 year ago

Same error. Of course it takes it like 30 mins or so to even reach the failure point in the code.

rgryta commented 1 year ago

This issue isn't easily solvable as wikiextractor relies on multiprocessing module and forking mechanism in order to create new processes instead of spawn that's available by Windows.

Your best option is to use WSL environment if you want to use officially distributed package. If you have to stick to Windows then you can try to use my quick patch for Windows support: https://github.com/attardi/wikiextractor/pull/315

However, this patch basically moves all logic from multiprocessing to multithreading - which has abysmal performance in comparison to mp due to GIL - almost linearly slower depending on your CPU count. That being said at least it works. Extraction speed is at about 150 articles/s.