WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

dumpgenerator.py --xmlrevisions creates Error:list index out of range on pokewiki.de #430

Closed GERZAC1002 closed 2 years ago

GERZAC1002 commented 2 years ago

Full comand that was used:

./dumpgenerator.py --xmlrevisions --images --xml --curonly https://pokewiki.de --namespace 0 I used the command without '--namespace 0' before with the same result, i only had to add it for reproducing the error while not putting to much stress on the wiki page it self.

Expected behaviour:

creating a dump of https://pokewiki.de

Actual behaviour after a a few minutes:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2569, in <module>
    main()
  File "./dumpgenerator.py", line 2561, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 2128, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 741, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "./dumpgenerator.py", line 877, in getXMLRevisions
    print "        %d more revisions listed, until %s" % (len(revids), revids[-1])
IndexError: list index out of range

Full log: dumgenerator.py_xmlrevisions.log

Tail of the output file:

{{Karte Designs/Zeile|typ=Farblos|Damythir-V (Time Gazer 059)|illus=aky CG Works|seltenheit=RR|num=1}}
{{Karte Designs/Zeile|typ=Farblos|Damythir-V (Time Gazer 076)|illus=aky CG Works|seltenheit=SR|num=2}}
&lt;/div&gt;

[[en:Wyrdeer V (Time Gazer 59)]]
[[ja:&#12450;&#12516;&#12471;&#12471;V (S10D)]]</text>
      <sha1>ip8lev6wdaqnyxpyw926h46ktlmtoup</sha1>
    </revision>
  </page> 

Quick 'integrity' check on the output file

 grep "<title>" -c *-current.xml ; grep "<page" -c *-current.xml ; grep "</page>" -c *-20220412-current.xml 
2231
2231
2231

Number of page titles in side *-titles.txt: 86796

Test without '--xmlrevisions'

./dumpgenerator.py --xmlrevisions --images --xml --curonly https://pokewiki.de --namespace 0
Checking API... https://www.pokewiki.de/api.php
API is OK: https://www.pokewiki.de/api.php
Checking index.php... https://www.pokewiki.de/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://www.pokewiki.de/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 0
Excluding titles from namespaces = None
1 namespaces found
    Retrieving titles in the namespace 0
    86795 titles retrieved in the namespace 0
Titles saved at... pokewikide-20220412-titles.txt
86795 page titles loaded
https://www.pokewiki.de/api.php
HTTP Error 404.
Not found. Is Special:Export enabled for this wiki?
https://www.pokewiki.de/index.php?action=submit&curonly=1&limit=1&pages=Main_Page&title=Special%3AExport

After using the pull request #280 back from 2016 and integrating it into a new version(pull request #429) i managed to get a full dump of the mentioned wiki.

nemobis commented 2 years ago

Unfortunately this wiki intentionally returns HTTP 403 for api.php in many cases. So it's an arms race; if we implement a workaround they will just block different user-agents or whatever. I suggest to contact the sysadmin so that they create regular dumps themselves and make them available on the Internet Archive, so people won't be tempted to export manually so often.

I don't recommend using your other method with Special:Export because it will increase their load and therefore invite more blocks.

GERZAC1002 commented 2 years ago

Oh okay but understandable considering that the whole dump ended up at over 30GB, I actually considered asking them for a dump if i hadn't found the alternative. The alternative to using this tool would have been mirroring the whole page using httrack which would have had a much bigger overhead as last time I tried that on a wiki page it tried to download the complete history of every page and had no options to easily exclude namespaces Any recommendations on how to put it on the internet archive as it is huge with all the images? (compressing the folder would still exceed the default maximum file size of a Fat32(sadly it is still a common standard) formatted drive so i don't know how viable it is) so i guess after answering the above question this issue can be closed as it seems like the features were intentionally disabled by the Administrators of the wiki

EDIT: found https://archive.org/download/wiki-pokewikide so is there a way to add the dump that I already have?(after i compressed it)

nemobis commented 2 years ago

Il 12/04/22 20:24, Gernot Zacharias ha scritto:

Any recommendations on how to put it on the internet archive as it is huge with all the images?

If the wiki admins made the dump, it would be on their server, so the upload to the Internet Archive would probably be quite fast.

(compressing the folder would still exceed the default maximum file size of a Fat32(sadly it is still a common standard) formatted drive so i don't know how viable it is)

You can start with the history 7z which launcher.py would produce, it's going to be much smaller. It's ok to upload a 30 GB file on the Internet Archive. If you have a FAT HDD, you can create 4 GB volumes.

If your connection is not sufficiently reliable/fast to finish a 30 GB upload, you can create a torrent file containing the file and upload the torrent file instead, it will then download from your torrent client.