Open nemobis opened 1 year ago
I tried another wiki (distrowiki.mirahrze.org) and nothing wrong happened. Maybe it's an occational error or an issue related to Python version, or something else.
I think I got an error HTTP 429, but we catch it and just proceed like nothing happened:
while True:
try:
arvrequest = site.api(http_method=config['http_method'], **arvparams)
except requests.exceptions.HTTPError as e:
if e.response.status_code == 405 and config['http_method'] == "POST":
print("POST request to the API failed, retrying with GET")
config['http_method'] = "GET"
continue
We should ideally implement a retry mechanism as we have in getXMLPage(), to avoid endless loops.
Image filenames and URLs saved at... denbagovmirahezeorg_w-20230618-images.txt
Retrieving images from "start"
Creating "./denbagovmirahezeorg_w-20230618-wikidump-2/images" directory
Traceback (most recent call last):
File "dumpgenerator.py", line 2572, in <module>
main()
File "dumpgenerator.py", line 2564, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 2147, in createNewDump
session=other['session'])
File "dumpgenerator.py", line 1524, in generateImageDump
r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))
File "/usr/lib/python2.7/urllib.py", line 1306, in quote
return ''.join(map(quoter, s))
KeyError: u'\u0420'
tail: cannot open 'denbagovmirahezeorg_w-20230618-wikidump/denbagovmirahezeorg_w-20230618-history.xml' for reading: No such file or directory
I don't understand the HTTP 502 errors
Analysing https://ubrwiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Retrieving image filenames
...................................... Found 1851 images
1851 image names loaded
Image filenames and URLs saved at... ubrwikimirahezeorg_w-20230618-images.txt
Retrieving images from "start"
Creating "./ubrwikimirahezeorg_w-20230618-wikidump/images" directory
Downloaded 10 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:1,00_M$.png" is wrong. Waiting 20 seconds and reloading...
Downloaded 20 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:1900.png" is wrong. Waiting 20 seconds and reloading...
Downloaded 30 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:2_turno.png" is wrong. Waiting 20 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 2, XML for "Image:2_turno.png" is wrong. Waiting 40 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 3, XML for "Image:2_turno.png" is wrong. Waiting 60 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 4, XML for "Image:2_turno.png" is wrong. Waiting 80 seconds and reloading...
HTTP Error 502.
Server error, max retries exceeded.
Please resume the dump later.
https://ubrwiki.miraheze.org/w/index.php?action=submit&curonly=1&limit=1&pages=Image%3A20M%24.png&title=Special%3AExport
ouch
Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... avidwiki_w-20230620-history.xml
Retrieving image filenames
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.
https://www.avid.wiki/w/api.php?aiprop=url%7Cuser&format=json&aifrom=WBRZ_2013.png&list=allimages&ailimit=50&action=query
Changed directory to /mnt/at/wikiteam/avidwiki_w-20230620-wikidump
606332
606332
606332
https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Not reproduced in the latest MW-Scraper.
Trying to export all revisions from namespace -1 (magic number refers to "all")
Trying to get wikitext from the allrevisions API and to build the XML
틀:동음이의, 30 edits (--xmlrevisions)
틀:반대, 1 edits (--xmlrevisions)
틀:찬성, 1 edits (--xmlrevisions)
틀:의견, 4 edits (--xmlrevisions)
틀:삭제, 4 edits (--xmlrevisions)
틀:유지, 2 edits (--xmlrevisions)
틀:이동, 1 edits (--xmlrevisions)
틀:넘겨주기, 1 edits (--xmlrevisions)
틀:중립, 1 edits (--xmlrevisions)
틀:병합, 1 edits (--xmlrevisions)
틀:질문, 1 edits (--xmlrevisions)
틀:분할, 1 edits (--xmlrevisions)
......
Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Maybe it was just an occasional error.
If e.response.status_code == 405 and config['http_method'] == "POST"
is False, arvrequest
will become unbound. (Escaped continue)
Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Maybe it was just an occasional error.