Mattschillinger / wikiteam

Automatically exported from code.google.com/p/wikiteam
0 stars 0 forks source link

commonschecker.py gets horribly slow with many files #64

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I'm checking the zips for Commons files of February 2011 and after over 1000 
CPU minutes it's not over yet. 1172081 files were uploaded that month, if I'm 
not mistaken:
$ grep -Ec "[|]201102[0-9]{8}" /data/project/commonsgrab2/2011/commonssql.csv 
1172081

I think the problem is that each line in the CSV files is checked against each 
line of the list of files in the ZIP, isn't it 
https://code.google.com/p/wikiteam/source/browse/trunk/commonschecker.py?spec=sv
n836&r=360#87 (line 87)?
Isn't there a way to sort them and check in order?

Original issue reported on code.google.com by nemow...@gmail.com on 9 Sep 2013 at 12:52

GoogleCodeExporter commented 8 years ago
It's still running, after 1600 CPU minutes.

I submitted r837 by Betacommand and the whole month (minus 3 ZIPs which I had 
to delete) now took:
real    1m16.499s
user    1m8.324s
sys     0m1.408s

I'd consider improving the speed by 1300-1400 times at least to be enough to 
consider this fixed. :)

Original comment by nemow...@gmail.com on 9 Sep 2013 at 5:43