attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Is Windows 10 supported? #312

Closed nissansz closed 1 year ago

nissansz commented 1 year ago

Is Windows 10 supported?

rgryta commented 1 year ago

I have noticed that there's an issue with multiprocessing when using Windows. I've patched that up by switching from multiprocessing to multithreading. This makes it SIGNIFICANTLY slower when using CPUs with many cores (~25 times slower on my 3900X) but at least it works.

I've added a Pull Request. In the meantime you can use my fork by adding: git+https://github.com/rgryta/wikiextractor.git@master to the requirements.txt instead of just wikiextractor.

nissansz commented 1 year ago

Thank you. Don't know how to add. Can just send updated zip file here?

rgryta commented 1 year ago

Wikiextractor project zip file? You can get it from git: https://github.com/rgryta/wikiextractor/archive/refs/heads/master.zip

If you're using pip then I'd recommend using that though: pip install git+https://github.com/rgryta/wikiextractor.git@master

nissansz commented 1 year ago

wikiextractor-master>python setup.py usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: setup.py --help [cmd1 cmd2 ...] or: setup.py --help-commands or: setup.py cmd --help

error: no commands supplied

python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 Traceback (most recent call last): File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package

rgryta commented 1 year ago

Are you using python 3.6 or higher?

Seems like you're located under some weird path and sys.path is not set properly, causing import errors. Please ensure that python --version returns python version 3.6 (or better).

Then ensure you have pip installed -> python -m pip --version. If it's an older version (let's say lower than 60.0.0), update pip with: python -m pip install --upgrade pip. If you don't have pip installed, then download get-pip.py from https://bootstrap.pypa.io/get-pip.py and execute: python get-pip.py Once you have pip installed, execute python -m pip install git+https://github.com/rgryta/wikiextractor.git@master.

If this won't work then unfortunately I probably won't be able to help.

nissansz commented 1 year ago

python --version Python 3.8.8

python -m pip install -upgrade pip

Usage: D:\Python3.8.8\python.exe -m pip install [options] [package-index-options] ... D:\Python3.8.8\python.exe -m pip install [options] -r [package-index-options] ... D:\Python3.8.8\python.exe -m pip install [options] [-e] ... D:\Python3.8.8\python.exe -m pip install [options] [-e] ... D:\Python3.8.8\python.exe -m pip install [options] <archive url/path> ...

no such option: -u

L:\data\wikiextractor-master>python -m pip install git+https://github.com/rgryta/wikiextractor.git@master Collecting git+https://github.com/rgryta/wikiextractor.git@master Cloning https://github.com/rgryta/wikiextractor.git (to revision master) to c:\users\ni\appdata\local\temp\pip-req-build-z7_jhu8r error: subprocess-exited-with-error

× git version did not run successfully. │ exit code: 1 ╰─> [2 lines of output] 'git' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error

× git version did not run successfully. │ exit code: 1 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

L:\data\wikiextractor-master>

rgryta commented 1 year ago

Should be --upgrade, sorry - copied command from somewhere and it truncated double hyphen to something weird.

rgryta commented 1 year ago

From the looks of it you may also need to install git cli. There's a compiled installer: https://git-scm.com/download/win You can of course use a different distribution, that's just the first one I found quickly on Google

nissansz commented 1 year ago

python -m pip install --upgrade pip Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)

rgryta commented 1 year ago

python -m pip install --upgrade pip Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)

In that case you're probably missing git command. You can get it from the link above (git-scm.com). After that just use python -m pip install git+https://github.com/rgryta/wikiextractor.git@master.

Unless the error you've gotten is about something else. Google Translator unfortunately struggled a bit with proper translation.

nissansz commented 1 year ago

image

rgryta commented 1 year ago

image

You can try another way.

Open the directory with extracted zip file that I sent at first. From the root directory of the project (so where the setup.py is located) execute: pip install . This SHOULD do the trick.

If it won't work then the pip install git+... is the way to go, but it looks like you have connection issues. Maybe VPN? See this thread for potential solutions: https://stackoverflow.com/questions/71571965/openssl-ssl-connect-connection-was-reset-in-connection-to-github-com443-while

nissansz commented 1 year ago

pip install setup.py ERROR: Could not find a version that satisfies the requirement setup.py (from versions: none) ERROR: No matching distribution found for setup.py

rgryta commented 1 year ago

No, not pip install setup.py Just pip install . with a dot at the end.

nissansz commented 1 year ago

wikiextractor-master>pip install . Processing l:\data\wikiextractor-master Preparing metadata (setup.py) ... done ERROR: No .egg-info directory found in C:\Users\Ni\AppData\Local\Temp\pip-pip-egg-info-b84l6p21

There is one common error for No egg. I don't know why there is this error. I see it often recently, don't know how to solve.

rgryta commented 1 year ago

Probably needs setuptools update. Execute pip install --upgrade setuptools And then retry the pip install .

nissansz commented 1 year ago

pip install --upgrade setuptools Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)

which packages are installed for the command? Maybe I can copy them to site-packages folder directly.

rgryta commented 1 year ago

pip install --upgrade setuptools Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)

which packages are installed for the command? Maybe I can copy them to site-packages folder directly.

I read that uninstalling setuptools may also fix it: pip uninstall setuptool (obviously after installing wikiextractor you should probably reinstall it though)

As for packages... wikitools has no dependencies so there aren't any unfortunately. There's something weird with your python configuration. Another package that you may try to install/uninstall is wheel: pip install wheel. You may have to try with different combinations of having those two packages installed/uninstalled (so wheel and setuptools both installed, both uninstalled, and just one installed). Hard to say what's wrong.

It's 1am here so I'll be leaving at that for now. If you'll still have some problem with it then you can write me in a few hours. Though I think a few queries to Google/Stack Overflow should suffice to fix it.

Good luck!

nissansz commented 1 year ago

installed, but still failed when using the code

pip uninstall setuptools Found existing installation: setuptools 67.8.0 Uninstalling setuptools-67.8.0: Would remove: d:\python3.8.8\lib\site-packages_distutils_hack* d:\python3.8.8\lib\site-packages\distutils-precedence.pth d:\python3.8.8\lib\site-packages\pkg_resources* d:\python3.8.8\lib\site-packages\setuptools-67.8.0.dist-info* d:\python3.8.8\lib\site-packages\setuptools* Proceed (Y/n)? y Successfully uninstalled setuptools-67.8.0

L:\data\wikiextractor-master>pip install . Processing l:\data\wikiextractor-master Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: wikiextractor Building wheel for wikiextractor (pyproject.toml) ... done Created wheel for wikiextractor: filename=wikiextractor-3.0.7-py3-none-any.whl size=47887 sha256=89c3060b72af9867ae877b249a60d3ba7fa00f6b194c7b8c467561b07a30c948 Stored in directory: c:\users\ni\appdata\local\pip\cache\wheels\ac\88\3b\0022eef871f6d21b6e24acdd2b6ca634c7b3fb274c1c5c6533 Successfully built wikiextractor Installing collected packages: wikiextractor Attempting uninstall: wikiextractor Found existing installation: wikiextractor 3.0.6 Uninstalling wikiextractor-3.0.6: Successfully uninstalled wikiextractor-3.0.

nissansz commented 1 year ago

python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 Traceback (most recent call last): File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package

nissansz commented 1 year ago

pip list wikiextractor 3.0.7 wikipedia 1.4.0

C:\Users\Ni>python wikiextractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 python: can't open file 'wikiextractor': [Errno 2] No such file or directory

nissansz commented 1 year ago

System path

image

rgryta commented 1 year ago

Don't use it like that. You're providing full path to the WikiExtractor file - which is a submodule of wikiextractor (so basically "wikiextractor.WikiExtractor". Relative imports will bne broken when you use it like this.

Use syntax provided in README.md: python -m wikiextractor.WikiExtractor <Wikipedia dump file>

rgryta commented 1 year ago

As for why python wikiextractor doesn't work -> wikiextractor project is not launchable through __main__.py (for some reason, it just is this way). Syntax from comment above should work.

nissansz commented 1 year ago

python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2

This seems working now, waiting for result.

python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 INFO: Preprocessing 'L:\data\kowiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time. INFO: Preprocessed 100000 pages

rgryta commented 1 year ago

I have not used --json option so I have no idea if it'll work. Fingers crossed that it does. Good luck. I gotta get some sleep.

nissansz commented 1 year ago

Thank you. Good night.

nissansz commented 1 year ago

Can extract result as linux now