Open lalitkumarj opened 7 years ago
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.
"multistream" apparently means it was compressed in a different way (see here), so maybe WikiExtractor doesn't know how to handle that. It works on non-multistream bz2 dump files.
I was able to run on cygwin by the command :
bzcat enwiki-latest-pages-articles-multistream.xml.bz2| WikiExtractor.py -o output -s --lists --filter_category categories.txt -
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.
Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -
?
So I was able to use:
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted
Not sure why it needs to work on an extracted version.Should not it be
bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -
?
It should be "bzcat enwiki-20170301-pages-articles-multistream.xml.bz2|python wikiextractor/ WikiExtractor.py -o extracted -"
WikiExtractor.py
depends on fileinput
.
fileinput
depends on bz2
when using fileinput.hook_compressed
and reading *.bz2
file.
But bz2.BZ2File
in python2 "does not support input files containing multiple streams", as it says here, https://docs.python.org/2.7/library/bz2.html
Possible workarounds would be:
python3
insteadmultistream
data (like #61
above)bzcat/stdin
method above)Import bz2file
as if it were bz2
before importing fileinput
, for example:
import sys
PY2 = sys.version_info[0] == 2
if PY2:
import bz2file as bz2
sys.modules['bz2'] = bz2
else:
import bz2
import fileinput
Hi All
Does the wikiextractor work directly on the bz2 file? I used python setup.py to install the WikiExtractor.
Here is the command I then used:
wikiextractor/WikiExtractor.py -o extracted enwiki-20170301-pages-articles-multistream.xml.bz2
Here is the output:
What am I doing wrong? Thanks!!