Closed nissansz closed 1 year ago
I have noticed that there's an issue with multiprocessing when using Windows. I've patched that up by switching from multiprocessing to multithreading. This makes it SIGNIFICANTLY slower when using CPUs with many cores (~25 times slower on my 3900X) but at least it works.
I've added a Pull Request. In the meantime you can use my fork by adding: git+https://github.com/rgryta/wikiextractor.git@master to the requirements.txt instead of just wikiextractor.
Thank you. Don't know how to add. Can just send updated zip file here?
Wikiextractor project zip file? You can get it from git: https://github.com/rgryta/wikiextractor/archive/refs/heads/master.zip
If you're using pip then I'd recommend using that though: pip install git+https://github.com/rgryta/wikiextractor.git@master
wikiextractor-master>python setup.py usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...] or: setup.py --help [cmd1 cmd2 ...] or: setup.py --help-commands or: setup.py cmd --help
error: no commands supplied
python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
Traceback (most recent call last):
File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in
Are you using python 3.6 or higher?
Seems like you're located under some weird path and sys.path is not set properly, causing import errors.
Please ensure that python --version
returns python version 3.6 (or better).
Then ensure you have pip installed -> python -m pip --version
. If it's an older version (let's say lower than 60.0.0), update pip with: python -m pip install --upgrade pip
.
If you don't have pip installed, then download get-pip.py from https://bootstrap.pypa.io/get-pip.py and execute: python get-pip.py
Once you have pip installed, execute python -m pip install git+https://github.com/rgryta/wikiextractor.git@master
.
If this won't work then unfortunately I probably won't be able to help.
python --version Python 3.8.8
python -m pip install -upgrade pip
Usage:
D:\Python3.8.8\python.exe -m pip install [options]
no such option: -u
L:\data\wikiextractor-master>python -m pip install git+https://github.com/rgryta/wikiextractor.git@master Collecting git+https://github.com/rgryta/wikiextractor.git@master Cloning https://github.com/rgryta/wikiextractor.git (to revision master) to c:\users\ni\appdata\local\temp\pip-req-build-z7_jhu8r error: subprocess-exited-with-error
× git version did not run successfully. │ exit code: 1 ╰─> [2 lines of output] 'git' 不是内部或外部命令,也不是可运行的程序 或批处理文件。 [end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error
× git version did not run successfully. │ exit code: 1 ╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
L:\data\wikiextractor-master>
Should be --upgrade
, sorry - copied command from somewhere and it truncated double hyphen to something weird.
From the looks of it you may also need to install git cli. There's a compiled installer: https://git-scm.com/download/win You can of course use a different distribution, that's just the first one I found quickly on Google
python -m pip install --upgrade pip Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)
python -m pip install --upgrade pip Requirement already satisfied: pip in d:\python3.8.8\lib\site-packages (23.1.2)
In that case you're probably missing git command. You can get it from the link above (git-scm.com). After that just use python -m pip install git+https://github.com/rgryta/wikiextractor.git@master
.
Unless the error you've gotten is about something else. Google Translator unfortunately struggled a bit with proper translation.
You can try another way.
Open the directory with extracted zip file that I sent at first. From the root directory of the project (so where the setup.py is located) execute: pip install .
This SHOULD do the trick.
If it won't work then the pip install git+... is the way to go, but it looks like you have connection issues. Maybe VPN? See this thread for potential solutions: https://stackoverflow.com/questions/71571965/openssl-ssl-connect-connection-was-reset-in-connection-to-github-com443-while
pip install setup.py ERROR: Could not find a version that satisfies the requirement setup.py (from versions: none) ERROR: No matching distribution found for setup.py
No, not pip install setup.py
Just pip install .
with a dot at the end.
wikiextractor-master>pip install . Processing l:\data\wikiextractor-master Preparing metadata (setup.py) ... done ERROR: No .egg-info directory found in C:\Users\Ni\AppData\Local\Temp\pip-pip-egg-info-b84l6p21
There is one common error for No egg. I don't know why there is this error. I see it often recently, don't know how to solve.
Probably needs setuptools update. Execute pip install --upgrade setuptools
And then retry the pip install .
pip install --upgrade setuptools Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)
which packages are installed for the command? Maybe I can copy them to site-packages folder directly.
pip install --upgrade setuptools Requirement already satisfied: setuptools in d:\python3.8.8\lib\site-packages (67.8.0)
which packages are installed for the command? Maybe I can copy them to site-packages folder directly.
I read that uninstalling setuptools may also fix it: pip uninstall setuptool
(obviously after installing wikiextractor you should probably reinstall it though)
As for packages... wikitools has no dependencies so there aren't any unfortunately. There's something weird with your python configuration. Another package that you may try to install/uninstall is wheel: pip install wheel
. You may have to try with different combinations of having those two packages installed/uninstalled (so wheel and setuptools both installed, both uninstalled, and just one installed). Hard to say what's wrong.
It's 1am here so I'll be leaving at that for now. If you'll still have some problem with it then you can write me in a few hours. Though I think a few queries to Google/Stack Overflow should suffice to fix it.
Good luck!
installed, but still failed when using the code
pip uninstall setuptools Found existing installation: setuptools 67.8.0 Uninstalling setuptools-67.8.0: Would remove: d:\python3.8.8\lib\site-packages_distutils_hack* d:\python3.8.8\lib\site-packages\distutils-precedence.pth d:\python3.8.8\lib\site-packages\pkg_resources* d:\python3.8.8\lib\site-packages\setuptools-67.8.0.dist-info* d:\python3.8.8\lib\site-packages\setuptools* Proceed (Y/n)? y Successfully uninstalled setuptools-67.8.0
L:\data\wikiextractor-master>pip install . Processing l:\data\wikiextractor-master Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: wikiextractor Building wheel for wikiextractor (pyproject.toml) ... done Created wheel for wikiextractor: filename=wikiextractor-3.0.7-py3-none-any.whl size=47887 sha256=89c3060b72af9867ae877b249a60d3ba7fa00f6b194c7b8c467561b07a30c948 Stored in directory: c:\users\ni\appdata\local\pip\cache\wheels\ac\88\3b\0022eef871f6d21b6e24acdd2b6ca634c7b3fb274c1c5c6533 Successfully built wikiextractor Installing collected packages: wikiextractor Attempting uninstall: wikiextractor Found existing installation: wikiextractor 3.0.6 Uninstalling wikiextractor-3.0.6: Successfully uninstalled wikiextractor-3.0.
python L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
Traceback (most recent call last):
File "L:\data\wikiextractor-master\wikiextractor\WikiExtractor.py", line 67, in
pip list wikiextractor 3.0.7 wikipedia 1.4.0
C:\Users\Ni>python wikiextractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 python: can't open file 'wikiextractor': [Errno 2] No such file or directory
System path
Don't use it like that. You're providing full path to the WikiExtractor file - which is a submodule of wikiextractor (so basically "wikiextractor.WikiExtractor". Relative imports will bne broken when you use it like this.
Use syntax provided in README.md: python -m wikiextractor.WikiExtractor <Wikipedia dump file>
As for why python wikiextractor doesn't work -> wikiextractor project is not launchable through __main__.py (for some reason, it just is this way). Syntax from comment above should work.
python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2
This seems working now, waiting for result.
python -m wikiextractor.WikiExtractor -b 50M -o L:/data/testko --json L:\data\kowiki-latest-pages-articles-multistream.xml.bz2 INFO: Preprocessing 'L:\data\kowiki-latest-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time. INFO: Preprocessed 100000 pages
I have not used --json option so I have no idea if it'll work. Fingers crossed that it does. Good luck. I gotta get some sleep.
Thank you. Good night.
Can extract result as linux now
Is Windows 10 supported?