Open GoogleCodeExporter opened 8 years ago
Not sure if this is a bug and if it's the same bug, but anyway: while trying to
download
http://it.wikihow.com/index.php?title=Discussioni_template:temp&action=history :
XML for "Discussioni_template:temp" is wrong. Waiting 20 seconds and reloading...
^CTraceback (most recent call last):
File "../../dumpgenerator.py", line 941, in <module>
main()
File "../../dumpgenerator.py", line 906, in main
generateXMLDump(config=config, titles=titles)
File "../../dumpgenerator.py", line 383, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "../../dumpgenerator.py", line 292, in getXMLPage
xml = getXMLPageCore(headers=headers, params=params, config=config)
File "../../dumpgenerator.py", line 268, in getXMLPageCore
xml = f.read()
File "/usr/lib/python2.7/socket.py", line 359, in read
return buf.getvalue()
The script was downloading at full bandwidth (1+ MiB/s) and reached almost 1
GiB of memory consumption after that "Waiting 20 seconds and reloading". That
page history is a monster full with GiB of spam, but probably it's not sane to
store the data in the RAM as the script seems to do.
Original comment by nemow...@gmail.com
on 8 Jul 2011 at 11:43
Another example, similar to the first one but a bit different because
apparently the download of the page didn't start (there's no reported revisions
download progress in chunks of 1000): http://p.defau.lt/?dZddltkd5YcV5zYjMcWvXA
Seems to be caused by the horribly huge history of this page. the next after
the last downloaded one:
http://wiki.guildwars.com/index.php?title=ArenaNet:Guild_Wars_2_suggestions/Scra
tchpad&action=history (7829 revisions, about 1900 MiB).
Original comment by nemow...@gmail.com
on 13 Jul 2011 at 12:39
As a workaround, I edited the titles list and moved the problematic titles to
the end to postpone the download of those histories and watch it more
carefully. In one case, I had the MemoryError despite python reaching only less
than 1 GiB RAM and almost 2 additional GiB of RAM being available; in another
case, the page history is 1.7 GiB when downloaded with Special:Export on
browser, I don't know how much was downloaded by the script. (Looking around a
bit, looks like it might be normal to have MemoryError at about 1 GiB of memory
whatever the amount of free memory.)
They're all like this: http://p.defau.lt/?3JuOkvmlwDGqi_1A30V6qQ .
Original comment by nemow...@gmail.com
on 17 Jul 2011 at 7:51
Again urbandead:
Traceback (most recent call last):
File "dumpgenerator.py", line 1205, in <module>
main()
File "dumpgenerator.py", line 1196, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1056, in resumePreviousDump
generateXMLDump(config=config, titles=titles, start=lastxmltitle)
File "dumpgenerator.py", line 457, in generateXMLDump
xml = getXMLPage(config=config, title=title)
File "dumpgenerator.py", line 389, in getXMLPage
xml = xml.split('</page>')[0] + ' <revision>' + ('<revision>'.join(xml2.split('<revision>')[1:]))
MemoryError
Original comment by nemow...@gmail.com
on 10 Nov 2013 at 9:23
Other examples are http://dota2.gamepedia.com/ (Template:Dictionary/defindex ,
13k revisions) and http://wowpedia.org/ (Patch mirrors, 7k revisions).
Original comment by nemow...@gmail.com
on 16 Feb 2014 at 2:01
Original issue reported on code.google.com by
emi...@gmail.com
on 16 Apr 2011 at 9:35