Open shidaide2019 opened 3 years ago
At first I think it maybe result from multiprocessing ,so I changed the processes to 1 and the error is the same as multiprocess.
i met the same problem,have you solved it?
I think there is a problems on Windows, passing file descriptors across threads. It would require some rewriting in order to open descriptors within threads.
I have the same problem on Windows ,too, how to solve it?
Same error. Of course it takes it like 30 mins or so to even reach the failure point in the code.
This issue isn't easily solvable as wikiextractor relies on multiprocessing module and forking mechanism in order to create new processes instead of spawn that's available by Windows.
Your best option is to use WSL environment if you want to use officially distributed package. If you have to stick to Windows then you can try to use my quick patch for Windows support: https://github.com/attardi/wikiextractor/pull/315
However, this patch basically moves all logic from multiprocessing to multithreading - which has abysmal performance in comparison to mp due to GIL - almost linearly slower depending on your CPU count. That being said at least it works. Extraction speed is at about 150 articles/s.
Sorry to disturb you, but I met a weird bug while extracting wiki bz2 My python version is 3.8, and anaconda version id 2020.11, I used pip install to get wikiextractor(3.0.4) and when I ran command
python -m wikiextractor.WikiExtractor -o extracted enwiki-20201220-pages-articles-multistream.xml.bz2
It comes out such error message after about 50 mins running:
Traceback (most recent call last): File "C:\Users\win\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\win\Anaconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 621, in
main()
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 616, in main
process_dump(input_file, args.templates, output_path, file_size,
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 357, in process_dump
reduce.start()
File "C:\Users\win\Anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\win\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\win\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
I'm looking forward to your answer.