Mattschillinger / wikiteam

Automatically exported from code.google.com/p/wikiteam
0 stars 0 forks source link

Download history of multiple pages at once #18

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I don't understand the code much so excuse me if I say silly things.
The script currently seems to download the history of a single page at once. To 
reduce the number of requests and making things faster, it would probably be 
better if it could ask the API for the history of multiple titles at once, 
after checking this won't cross the revisions limit.

Original issue reported on code.google.com by nemow...@gmail.com on 9 Jul 2011 at 3:50

GoogleCodeExporter commented 8 years ago
Yep. I have thought about this before, to speed up the script (but it will need 
more kb/s, and it may not be nice with the server), and decrease server 
requests.

But there is a chance of errors with the Special:Export limit revisions. So, 
this is a dangerous option that I would like to test so much before releasing.

Original comment by emi...@gmail.com on 9 Jul 2011 at 7:01

GoogleCodeExporter commented 8 years ago
I'm available for testing. :-)
This is probably best used with API only. In that case, I think it could even 
decrease server load because it reduces requests: sometimes people arrive on 
#wikimedia-tech to ask how to crawl WMF sites without problems and IIRC they're 
even suggested to download pages in batches of 50.
A bandwidth throttle may be useful but I don't know if it's possible to control 
this aspect.

Original comment by nemow...@gmail.com on 9 Jul 2011 at 7:29

GoogleCodeExporter commented 8 years ago
This would also require Issue 8 (downloads are buffered completely to memory 
before writing to disk) to be resolved, as this would definitely result in 
larger downloads than before.

Original comment by griffin....@gmail.com on 9 Jul 2011 at 8:04

GoogleCodeExporter commented 8 years ago
I'm not sure, after all the memory consumption is mostly almost negligible 
right now.
If you put some limit to number of revisions you shouldn't have any problem; 
and again, I'm ready to test memory consumption and do crash-tests as well. :-p

Original comment by nemow...@gmail.com on 9 Jul 2011 at 9:47

GoogleCodeExporter commented 8 years ago

Original comment by nemow...@gmail.com on 29 Feb 2012 at 11:27