Wikiextractor not extracting

attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps

GNU Affero General Public License v3.0

3.74k stars 965 forks source link

Wikiextractor not extracting #124

Open lalitkumarj opened 7 years ago

lalitkumarj commented 7 years ago

Hi All

Does the wikiextractor work directly on the bz2 file? I used python setup.py to install the WikiExtractor.

Here is the command I then used: wikiextractor/WikiExtractor.py -o extracted enwiki-20170301-pages-articles-multistream.xml.bz2

Here is the output:

INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20170301-pages-articles-multistream.xml.bz2.
INFO: Using 7 extract processes.
INFO: Finished 7-process extraction of 0 articles in 0.0s (0.0 art/s)

What am I doing wrong? Thanks!!

lalitkumarj commented 7 years ago

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted

Not sure why it needs to work on an extracted version.

BrenBarn commented 7 years ago

"multistream" apparently means it was compressed in a different way (see here), so maybe WikiExtractor doesn't know how to handle that. It works on non-multistream bz2 dump files.

markdimi commented 7 years ago

https://github.com/attardi/wikiextractor/issues/61

astha-chem commented 6 years ago

I was able to run on cygwin by the command : bzcat enwiki-latest-pages-articles-multistream.xml.bz2| WikiExtractor.py -o output -s --lists --filter_category categories.txt -

zhixiaochuan12 commented 5 years ago

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted

Not sure why it needs to work on an extracted version.

Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -?

bruce803 commented 4 years ago

So I was able to use: bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted Not sure why it needs to work on an extracted version.

Should not it be bzcat enwiki-20170301-pages-articles-multistream.xml.bz2| wikiextractor/WikiExtractor.py -o extracted -?

It should be "bzcat enwiki-20170301-pages-articles-multistream.xml.bz2|python wikiextractor/ WikiExtractor.py -o extracted -"

elbakramer commented 4 years ago

WikiExtractor.py depends on fileinput. fileinput depends on bz2 when using fileinput.hook_compressed and reading *.bz2 file. But bz2.BZ2File in python2 "does not support input files containing multiple streams", as it says here, https://docs.python.org/2.7/library/bz2.html

Possible workarounds would be:

Use python3 instead
Don't use multistream data (like #61 above)
Use decompressed data (like bzcat/stdin method above)

Import bz2file as if it were bz2 before importing fileinput, for example:

import sys
PY2 = sys.version_info[0] == 2
if PY2:
    import bz2file as bz2
    sys.modules['bz2'] = bz2
else:
    import bz2
import fileinput