Open GoogleCodeExporter opened 9 years ago
(another comment that I will spin off to another defect)
There's several places I saw last night where a Bloom filter would help
performance. If you can add Bloom filter code to skipfish.c somewhere,
I'll try to use your code and measure send patches as appropriate.
There's a reasonable Bloom filter implementation @rpm5.org in rpmio/rpmbf.c
that can be swiped and adjusted to taste if you want.
(another aside, relevant to hash <-> bloom filters)
FYI: Your pivot hash function could be improved a bit. lookup3.c is what I
typically use because a highly tuned hash pair generator is implemented in
lookup.c
and I'm lazy. YMMV, and hash functions are another religious/arbitrary issue, My
goal is higher performing skipfish.c, not anything else.
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 1:52
This teensy bit of inlining kills off piggie #5:
Index: http_client.c
===================================================================
RCS file: /v/rpm/cvs/skipfish/http_client.c,v
retrieving revision 1.1.1.1
diff -p -u -w -r1.1.1.1 http_client.c
--- http_client.c 4 Jul 2010 19:53:33 -0000 1.1.1.1
+++ http_client.c 5 Jul 2010 19:25:58 -0000
@@ -1144,7 +1144,12 @@ void fprint_response(struct http_respons
for (i=0;i<res->pay_len;i++)
- if (res->payload[i] <= 0x20 || strchr("<>'\"", res->payload[i])) {
+ if (res->payload[i] <= 0x20
+ || res->payload[i] == '<'
+ || res->payload[i] == '>'
+ || res->payload[i] == '\''
+ || res->payload[i] == '"')
+ {
if (!in_space) {
in_space = 1;
if (c_len <= FP_MAX_LEN)
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 7:28
Ah, tricked by my toy test using -r 1000. So the piggy list above is
heavily biased towards setup/teardown issues (and is also biased by
the dkipjack.wl I chose).
Recalibrating to -r 10000 I see these piggies:
4,872,672,720 PROGRAM TOTALS
1,295,474,582 * http_client.c:next_from_queue [/X/src/wdj/skipfish/skipfish]
396,383,776 * ???:_int_malloc [/lib/libc-2.11.1.so]
359,937,711 * ???:_int_free [/lib/libc-2.11.1.so]
257,351,145 * ???:__GI_memset [/lib/libc-2.11.1.so]
239,443,108 * ???:strcasecmp [/lib/libc-2.11.1.so]
...
which is more promising because the main event loop is where it should be.
Since the main event loop looks like
for (i=0;i<conn_cur;i++) {
...
}
there's an opportunity to use -fopenmp and a #pragma and
split the I/O across multiple cores. No friggin' clue yet but
looks feasible.
And here's a dinky change to avoid re-allocing the polls on every
call in a moment. conn_cur is capped by max_connections, so the free-and-realloc
of the pollfd's can be done once, and a memset for the current connections
zero'ed
instead
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 8:49
If you wish to work on this, I will be happy to accept performance-improving
patches.
My impression is that in reality, the bottleneck is network I/O, and skipfish
seldom has an appreciable CPU usage; this is why I haven't put too much effort
into optimizing the internal data structures, data parsing, or memory
allocation.
Original comment by lcam...@gmail.com
on 5 Jul 2010 at 9:02
Here's the patch to avoid an alloc/free within your main polling loop:
Index: http_client.c
===================================================================
RCS file: /v/rpm/cvs/skipfish/http_client.c,v
retrieving revision 1.1.1.1
diff -p -u -w -r1.1.1.1 http_client.c
--- http_client.c 4 Jul 2010 19:53:33 -0000 1.1.1.1
+++ http_client.c 5 Jul 2010 21:14:48 -0000
@@ -1922,19 +1927,21 @@ connect_error:
routine for the scanning loop. */
u32 next_from_queue(void) {
+ static struct pollfd* p;
u32 cur_time = time(0);
+ if (p == NULL)
+ p = __DFL_ck_alloc(sizeof(struct pollfd) * max_connections);
+
if (conn_cur) {
- static struct pollfd* p;
struct conn_entry* c = conn;
u32 i = 0;
/* First, go through all connections, handle connects, SSL handshakes, data
reads and writes, and exceptions. */
- if (p) free(p);
- p = __DFL_ck_alloc(sizeof(struct pollfd) * conn_cur);
+ memset(p, 0, sizeof(struct pollfd) * conn_cur);
while (c) {
p[i].fd = c->fd;
I'll get some measurements eventually (but as you can tell I don't
yet have any reasonable comparison point established).
You are likely correct the network I/O is the bottleneck to skipfish per se.
But I'm hoping to steal your pivot store for RPM package management,
and am also (I maintain POPT library) gonna use skipfish to automate
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 9:18
(interrupted, continuing)
--OPTION KEY=VALUE,KEY=VALUE,...
processing (skipfish has several "mappings" of this form).
I also am questing spidering code in C so that I can index *.rpm
packages into a Cassandra distributed store. That's an entirely
different usage case than, say, installing 404's into web logs ;-)
But I can/will likely send along performance patches as I get up to speed
on your skipfish code. Just expect a few brain farts from me, feel free to
ignore.
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 9:22
I incorporated that strchr and poll() tweaks.
Original comment by lcam...@gmail.com
on 5 Jul 2010 at 10:12
Cool. Meanwhile here is a _HIGHLY_ speculative patch (all I can tell is that it
doesn't
segfault) to parallelize your next_from_queue() polling loop using -fopenmp.
I've parallelized a couple
loops, but I am by no means an expert at using -fopenmp.
But if you can parallelize your service model, you should be able to crank
even more performance from skipfish (albeit mostly by reducing inter-request
service latency which may not be too large a win).
@@ -1947,10 +1955,20 @@ u32 next_from_queue(void) {
poll(p, conn_cur, 100);
+ cptrs = ck_alloc(conn_cur * sizeof(c));
c = conn;
+ for (i=0;i<conn_cur;i++) {
+ cptrs[i] = c;
+ c = c->next;
+ }
+#if defined(_OPENMP)
+#pragma omp parallel for \
+
shared(cptrs,req_retried,req_errors_net,req_errors_cur,bytes_sent,conn_idle_tmou
t,tear_down_idle) private(c,i)
+#endif
for (i=0;i<conn_cur;i++) {
+ c = cptrs[i];
struct conn_entry* next = c->next;
/* Connection closed: see if we have any pending data to write. If yes,
@@ -2164,9 +2182,8 @@ network_error:
}
- c = next;
-
}
+ free(cptrs);
}
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 10:32
And I can't claim anything because no -fopenmp. News shortly ...
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 10:42
Now confirmed "Compiled with -fopenmp" and "skipfish doesn't segfault". Here's
the cchegrind (I keep saying callgrind) verifier that indeed, next_from_queue()
is parallelized:
1,203,893,480 * http_client.c:next_from_queue [/X/src/wdj/skipfish/skipfish]
140,864 > ???:poll (3407x) [/lib/libc-2.11.1.so]
2,426,977 > ???:GOMP_parallel_end (3407x) [/usr/lib/libgomp.so.1.0.0]
4,205,159 > ???:GOMP_parallel_start (3407x) [/usr/lib/libgomp.so.1.0.0]
546 > /X/src/wdj/skipfish/alloc-inl.h:__DFL_ck_alloc (1x)
121,703,883 > http_client.c:conn_associate (10007x) [/X/src/wdj/skipfish/skipfish]
51,120 > ???:time (3408x) [/lib/libc-2.11.1.so]
459,809 > ???:free (3407x) [/lib/libc-2.11.1.so]
6,906 > ???:_dl_runtime_resolve (4x) [/lib/ld-2.11.1.so]
3,914,728,145 > http_client.c:next_from_queue.omp_fn.0 (3407x)
[/X/src/wdj/skipfish/skipfish]
136,109 > ???:__memset_sse2 (3407x) [/lib/libc-2.11.1.so]
Any suggestions for better loops to parallelize or better invocation test cases?
/usr/bin/time with/without --enable-openmp is essentially identical.
crap ... forget to turn on multiple cpu's in VMFuxion ... hang on ...
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 11:12
(please take all these measurements as _HIGHLY_ speculative and -r 10000
limited)
OK, on a 4-way VM I'm seeing w /usr/bin/time
296.43user 16.15system 1:42.24elapsed 305%CPU (0avgtext+0avgdata 13648maxresident)k
960inputs+0outputs (4major+1174minor)pagefaults 0swaps
Same code on a 1-way VM
1.03user 5.01system 1:29.90elapsed 6%CPU (0avgtext+0avgdata 10736maxresident)k
0inputs+5112outputs (0major+1432minor)pagefaults 0swaps
And compiled with --disable-openmp
0.92user 4.76system 1:33.53elapsed 6%CPU (0avgtext+0avgdata 10192maxresident)k
0inputs+5112outputs (0major+1391minor)pagefaults 0swaps
I'm also seeing HTTP requests kB/s ~2x larger (but this is _NOT_ a careful
measurement at all,
I reserve the right to deny everything ;-).
So it looks _FEASIBLE_ to use -fopenmp to attempt multi-threaded I/O more
intelligently
down the road a bit.
Original comment by n3npq....@gmail.com
on 5 Jul 2010 at 11:28
Ah there's the segfaults (as I expected). Most of the issues (from looking at
3-4 of the
segfaults) have to do with thread local storage being realloc'd within thread
and
than accessed/free'd out of the loop. That's _FEASIBLY_ fixable in my book, but
"Caveat emptor"
applies.
Original comment by n3npq....@gmail.com
on 6 Jul 2010 at 1:50
i downloaded and installed cygwin for win7 and tried to makefile but not able
to do anything ...keep getting message command not found.
Original comment by r13chau...@gmail.com
on 21 Sep 2012 at 1:16
Original issue reported on code.google.com by
n3npq....@gmail.com
on 5 Jul 2010 at 1:42