What steps will reproduce the problem?
1. WARC Tools r245 warcvalidator.c: r211 | voidptrptr | 2008-11-07
2. time warcvalidator -f 100MB.warc.gz
What is the expected output? What do you see instead?
expected time comparable to zcat (around 30sec, but took over 15min instead.
Please use labels and text to provide additional information.
from warc-tools list post [Sep 14 2009, 8:46 am]:
hi WARC Tools,
warcvalidator seems to take an inordinate amount of time
to validate our warc files, which ultimately do turn out
to be valid.
WARC Tools r245
warcvalidator.c: r211 | voidptrptr | 2008-11-07
on an unloaded crawler with dual 2.6GHz cpus and 4GB memory,
it took about 5 hours to process a 1GB WARC (the new standard
size) and 15 minutes to validate a 100MB WARC. gzip takes about
20 seconds to unpack a 1GB WARC.
OS: Ubuntu 5.10 "Breezy Badger"
kernel: Linux 2.6.16.1 #1 SMP May11 2006 x86_64 GNU/Linux
cpu: 2 x AMD 64GB 2605.873MHz
mem: 4015252k total
during processing, the cpus were mostly idle, about 90% of
memory was in use, and disk activity was low. in top, with a
sample delay of 0.1 seconds, i could see that warcvalidator
appears briefly and then goes away for a few seconds, then
appears briefly again.
time zcat WARC_1GB > /dev/null
real 0m20.781s
user 0m19.620s
sys 0m1.150s
time warcvalidator -f WARC_100MB
real 15m10.865s
user 0m8.510s
sys 0m8.770s
time warcvalidator -f WARC_1GB
real 298m58.218s (4.97 hrs)
user 1m21.760s
sys 1m35.540s
any idea what warcvalidator is doing?
http://groups.google.com/group/warc-tools/browse_thread/thread/583fc0f407deaec7/
d24adbf3e57995b2#d24adbf3e57995b2
Original issue reported on code.google.com by st...@archive.org on 1 Apr 2010 at 9:53
Original issue reported on code.google.com by
st...@archive.org
on 1 Apr 2010 at 9:53