Large histories memory error

GoogleCodeExporter commented 8 years ago

RationalWiki:SPOV 18 edits
RationalWiki:Saloon Bar 1 edits
RationalWiki:Saloon Bar/Drink counter/Archive 1 2 edits
RationalWiki:Saloon bar 2000 edits
RationalWiki:Saloon bar 3000 edits
RationalWiki:Saloon bar 4000 edits
RationalWiki:Saloon bar 5000 edits
RationalWiki:Saloon bar 6000 edits
RationalWiki:Saloon bar 7000 edits
RationalWiki:Saloon bar 8000 edits
RationalWiki:Saloon bar 9000 edits
RationalWiki:Saloon bar 10000 edits
RationalWiki:Saloon bar 11000 edits
RationalWiki:Saloon bar 12000 edits
Traceback (most recent call last):
  File "dumpgenerator.py", line 878, in <module>
    f.close()
  File "dumpgenerator.py", line 785, in main
    xmltitles = re.findall(r'<title>([^<]+)</title>', l) #weird if found more than 1, but maybe
  File "dumpgenerator.py", line 335, in generateXMLDump
    if c % 10 == 0:
  File "dumpgenerator.py", line 279, in getXMLPage
    xml = xml.split('</page>')[0]+xml2.split('<page>\n')[1]
MemoryError

Original issue reported on code.google.com by emi...@gmail.com on 16 Apr 2011 at 9:35

GoogleCodeExporter commented 8 years ago

Not sure if this is a bug and if it's the same bug, but anyway: while trying to 
download 
http://it.wikihow.com/index.php?title=Discussioni_template:temp&action=history :

    XML for "Discussioni_template:temp" is wrong. Waiting 20 seconds and reloading...
^CTraceback (most recent call last):
  File "../../dumpgenerator.py", line 941, in <module>
    main()
  File "../../dumpgenerator.py", line 906, in main
    generateXMLDump(config=config, titles=titles)
  File "../../dumpgenerator.py", line 383, in generateXMLDump
    xml = getXMLPage(config=config, title=title)
  File "../../dumpgenerator.py", line 292, in getXMLPage
    xml = getXMLPageCore(headers=headers, params=params, config=config)
  File "../../dumpgenerator.py", line 268, in getXMLPageCore
    xml = f.read()
  File "/usr/lib/python2.7/socket.py", line 359, in read
    return buf.getvalue()

The script was downloading at full bandwidth (1+ MiB/s) and reached almost 1 
GiB of memory consumption after that "Waiting 20 seconds and reloading". That 
page history is a monster full with GiB of spam, but probably it's not sane to 
store the data in the RAM as the script seems to do.

Original comment by nemow...@gmail.com on 8 Jul 2011 at 11:43

GoogleCodeExporter commented 8 years ago

Another example, similar to the first one but a bit different because 
apparently the download of the page didn't start (there's no reported revisions 
download progress in chunks of 1000): http://p.defau.lt/?dZddltkd5YcV5zYjMcWvXA
Seems to be caused by the horribly huge history of this page. the next after 
the last downloaded one: 
http://wiki.guildwars.com/index.php?title=ArenaNet:Guild_Wars_2_suggestions/Scra
tchpad&action=history (7829 revisions, about 1900 MiB).

Original comment by nemow...@gmail.com on 13 Jul 2011 at 12:39

GoogleCodeExporter commented 8 years ago

As a workaround, I edited the titles list and moved the problematic titles to 
the end to postpone the download of those histories and watch it more 
carefully. In one case, I had the MemoryError despite python reaching only less 
than 1 GiB RAM and almost 2 additional GiB of RAM being available; in another 
case, the page history is 1.7 GiB when downloaded with Special:Export on 
browser, I don't know how much was downloaded by the script. (Looking around a 
bit, looks like it might be normal to have MemoryError at about 1 GiB of memory 
whatever the amount of free memory.)
They're all like this: http://p.defau.lt/?3JuOkvmlwDGqi_1A30V6qQ .

Original comment by nemow...@gmail.com on 17 Jul 2011 at 7:51

GoogleCodeExporter commented 8 years ago

Again urbandead:

Traceback (most recent call last):
  File "dumpgenerator.py", line 1205, in <module>
    main()
  File "dumpgenerator.py", line 1196, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1056, in resumePreviousDump
    generateXMLDump(config=config, titles=titles, start=lastxmltitle)
  File "dumpgenerator.py", line 457, in generateXMLDump
    xml = getXMLPage(config=config, title=title)
  File "dumpgenerator.py", line 389, in getXMLPage
    xml = xml.split('</page>')[0] + '    <revision>' + ('<revision>'.join(xml2.split('<revision>')[1:]))
MemoryError

Original comment by nemow...@gmail.com on 10 Nov 2013 at 9:23

GoogleCodeExporter commented 8 years ago

Other examples are http://dota2.gamepedia.com/ (Template:Dictionary/defindex , 
13k revisions) and http://wowpedia.org/ (Patch mirrors, 7k revisions).

Original comment by nemow...@gmail.com on 16 Feb 2014 at 2:01

fsrrt / wikiteam

Large histories memory error #8