WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
719 stars 149 forks source link

Assorted dumpgenerator.py failures with some Miraheze (MediaWiki 1.39.3) wikis #467

Open nemobis opened 1 year ago

nemobis commented 1 year ago
Titles saved at... bigforestmirahezeorg_w-20230617-titles.txt
18795 page titles loaded                                                                                                                                                                                                                    https://bigforest.miraheze.org/w/api.php
Getting the XML header from the API                                                                                   
Retrieving the XML for every page from the beginning
42 namespaces found                                                                                                   
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML                                                                                                                                                                       Traceback (most recent call last):                                                                                    
  File "dumpgenerator.py", line 2572, in <module>
    main()
  File "dumpgenerator.py", line 2564, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 2135, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "dumpgenerator.py", line 742, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "dumpgenerator.py", line 843, in getXMLRevisions
    for page in arvrequest['query']['allrevisions']:
UnboundLocalError: local variable 'arvrequest' referenced before assignment
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Maybe it was just an occasional error.

GT-610 commented 1 year ago

I tried another wiki (distrowiki.mirahrze.org) and nothing wrong happened. Maybe it's an occational error or an issue related to Python version, or something else.

nemobis commented 1 year ago

I think I got an error HTTP 429, but we catch it and just proceed like nothing happened:

                while True:
                    try:
                        arvrequest = site.api(http_method=config['http_method'], **arvparams)
                    except requests.exceptions.HTTPError as e:
                        if e.response.status_code == 405 and config['http_method'] == "POST":
                            print("POST request to the API failed, retrying with GET")
                            config['http_method'] = "GET"
                            continue

We should ideally implement a retry mechanism as we have in getXMLPage(), to avoid endless loops.

nemobis commented 1 year ago
Image filenames and URLs saved at... denbagovmirahezeorg_w-20230618-images.txt                                        
Retrieving images from "start"                                                                                        
Creating "./denbagovmirahezeorg_w-20230618-wikidump-2/images" directory                                               
Traceback (most recent call last):                                                                                    
  File "dumpgenerator.py", line 2572, in <module>                                                                                                                                                                                           
    main()                                                                                                            
  File "dumpgenerator.py", line 2564, in main                                                                         
    createNewDump(config=config, other=other)                                                                         
  File "dumpgenerator.py", line 2147, in createNewDump                                                                
    session=other['session'])                                                                                         
  File "dumpgenerator.py", line 1524, in generateImageDump
    r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))             
  File "/usr/lib/python2.7/urllib.py", line 1306, in quote                                                            
    return ''.join(map(quoter, s))                                                                                    
KeyError: u'\u0420'                                                                                                   
tail: cannot open 'denbagovmirahezeorg_w-20230618-wikidump/denbagovmirahezeorg_w-20230618-history.xml' for reading: No such file or directory
nemobis commented 1 year ago

I don't understand the HTTP 502 errors

Analysing https://ubrwiki.miraheze.org/w/api.php                                                                      
Trying generating a new dump into a new directory...                                                                  
Retrieving image filenames                                                                                            
......................................    Found 1851 images                                                           
1851 image names loaded                                                                                               
Image filenames and URLs saved at... ubrwikimirahezeorg_w-20230618-images.txt                                         
Retrieving images from "start"                           
Creating "./ubrwikimirahezeorg_w-20230618-wikidump/images" directory     
    Downloaded 10 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:1,00_M$.png" is wrong. Waiting 20 seconds and reloading...
    Downloaded 20 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:1900.png" is wrong. Waiting 20 seconds and reloading...
    Downloaded 30 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:2_turno.png" is wrong. Waiting 20 seconds and reloading...                           
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 2, XML for "Image:2_turno.png" is wrong. Waiting 40 seconds and reloading...                           
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
    In attempt 3, XML for "Image:2_turno.png" is wrong. Waiting 60 seconds and reloading...
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
    In attempt 4, XML for "Image:2_turno.png" is wrong. Waiting 80 seconds and reloading...
HTTP Error 502.                                          
Server error, max retries exceeded.                                                                                   
Please resume the dump later.
https://ubrwiki.miraheze.org/w/index.php?action=submit&curonly=1&limit=1&pages=Image%3A20M%24.png&title=Special%3AExport
nemobis commented 1 year ago

ouch

Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... avidwiki_w-20230620-history.xml
Retrieving image filenames
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.
https://www.avid.wiki/w/api.php?aiprop=url%7Cuser&format=json&aifrom=WBRZ_2013.png&list=allimages&ailimit=50&action=query
Changed directory to /mnt/at/wikiteam/avidwiki_w-20230620-wikidump
606332
606332
606332
yzqzss commented 1 year ago

https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Not reproduced in the latest MW-Scraper.

Trying to export all revisions from namespace -1 (magic number refers to "all")
Trying to get wikitext from the allrevisions API and to build the XML
틀:동음이의, 30 edits (--xmlrevisions)
틀:반대, 1 edits (--xmlrevisions)
틀:찬성, 1 edits (--xmlrevisions)
틀:의견, 4 edits (--xmlrevisions)
틀:삭제, 4 edits (--xmlrevisions)
틀:유지, 2 edits (--xmlrevisions)
틀:이동, 1 edits (--xmlrevisions)
틀:넘겨주기, 1 edits (--xmlrevisions)
틀:중립, 1 edits (--xmlrevisions)
틀:병합, 1 edits (--xmlrevisions)
틀:질문, 1 edits (--xmlrevisions)
틀:분할, 1 edits (--xmlrevisions)
......

yzqzss commented 1 year ago

Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Maybe it was just an occasional error.

If e.response.status_code == 405 and config['http_method'] == "POST" is False, arvrequest will become unbound. (Escaped continue)

https://github.com/wikiteam/wikiteam/blob/b9f861d8c206bd39f1293f1c16c008a5c141b47b/dumpgenerator.py#L829