callgrind measured hotspot elimination

GoogleCodeExporter commented 9 years ago

Callgrind driven tuning will take a bit to sort out, so this isn't
yet a defect or a RFE or anything else, just documenting what I see.

For the same toy test case I've been using, here are the top 10 piggies
summarized (snipped from callgrind output):

2,967,898,164  PROGRAM TOTALS

536,699,471  *  ???:strcasecmp [/lib/libc-2.11.1.so]
289,093,263  *  ???:__GI_strstr [/lib/libc-2.11.1.so]
187,376,970  *  ???:memchr [/lib/libc-2.11.1.so]
170,039,437  *  ???:_int_malloc [/lib/libc-2.11.1.so]
157,438,620  *  ???:index [/lib/ld-2.11.1.so]
148,169,230  *  ???:_int_free [/lib/libc-2.11.1.so]
138,960,360  *  database.c:is_c_sens [/X/src/wdj/skipfish/skipfish]
135,768,606  *  crawler.c:crawl_dir_dict_init [/X/src/wdj/skipfish/skipfish]
109,660,680  *  ???:__GI_memset [/lib/libc-2.11.1.so]
 83,389,324  *  ???:free [/lib/libc-2.11.1.so]

So you are quite right to attempt malloc speedups (even if
malloc_usable_size is gonna be tricky imho, particularly
because of MALLOC_CHECK_ issue described at
    http://sourceware.org/bugzilla/show_bug.cgi?id=1349
BTW, I instantly saw MALLOC_CHECK_ issues when using callgrind.

However there are some alternatives to using malloc_usable_size()
to speed-up skipfish.

E.g. destroy_req() shows up prominently in the call chains.
The struct is constant sized and so malloc/free overhead
can be avoided by chaining already alloc'd items for reuse.

Another speed-up would be to fiddle up better than linear
search (is_c_sens and strcasecmp are bboth in the top 10 piggies list)
for loops like (crawler.c:2624)

      for (c=0;c<pv->child_cnt;c++)
        if (!((is_c_sens(pv) ? strcmp : strcasecmp)((char*)tmp,
            (char*)pv->child[c]->name))) break;

Another area for improvement is the use of strstr() in analysis.c,
although the parsing there is hard to do no matter what.

Please note that there's nothing particularly wrong or slow with
skipfish.c as is. I'm just calibrating my expectations before attempting
better using callgrind.

Original issue reported on code.google.com by n3npq....@gmail.com on 5 Jul 2010 at 1:42

GoogleCodeExporter commented 9 years ago

(another comment that I will spin off to another defect)

There's several places I saw last night where a Bloom filter would help
performance. If you can add Bloom filter code to skipfish.c somewhere,
I'll try to use your code and measure send patches as appropriate.

There's a reasonable Bloom filter implementation @rpm5.org in rpmio/rpmbf.c
that can be swiped and adjusted to taste if you want.

(another aside, relevant to hash <-> bloom filters)
FYI: Your pivot hash function could be improved a bit. lookup3.c is what I
typically use because a highly tuned hash pair generator is implemented in 
lookup.c
and I'm lazy. YMMV, and hash functions are another religious/arbitrary issue, My
goal is higher performing skipfish.c, not anything else.

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 1:52

GoogleCodeExporter commented 9 years ago

This teensy bit of inlining kills off piggie #5:

Index: http_client.c
===================================================================
RCS file: /v/rpm/cvs/skipfish/http_client.c,v
retrieving revision 1.1.1.1
diff -p -u -w -r1.1.1.1 http_client.c
--- http_client.c   4 Jul 2010 19:53:33 -0000   1.1.1.1
+++ http_client.c   5 Jul 2010 19:25:58 -0000
@@ -1144,7 +1144,12 @@ void fprint_response(struct http_respons

   for (i=0;i<res->pay_len;i++)

-    if (res->payload[i] <= 0x20 || strchr("<>'\"", res->payload[i])) {
+    if (res->payload[i] <= 0x20
+     || res->payload[i] == '<'
+     || res->payload[i] == '>'
+     || res->payload[i] == '\''
+     || res->payload[i] == '"')
+    {
       if (!in_space) {
         in_space = 1;
         if (c_len <= FP_MAX_LEN)

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 7:28

GoogleCodeExporter commented 9 years ago

Ah, tricked by my toy test using -r 1000. So the piggy list above is
heavily biased towards setup/teardown issues (and is also biased by
the dkipjack.wl I chose).

Recalibrating to -r 10000 I see these piggies:

4,872,672,720  PROGRAM TOTALS
1,295,474,582  *  http_client.c:next_from_queue [/X/src/wdj/skipfish/skipfish]
   396,383,776  *  ???:_int_malloc [/lib/libc-2.11.1.so]
   359,937,711  *  ???:_int_free [/lib/libc-2.11.1.so]
   257,351,145  *  ???:__GI_memset [/lib/libc-2.11.1.so]
   239,443,108  *  ???:strcasecmp [/lib/libc-2.11.1.so]
   ...
which is more promising because the main event loop is where it should be.

Since the main event loop looks like
    for (i=0;i<conn_cur;i++) {
        ...
    }
there's an opportunity to use -fopenmp and a #pragma and
split the I/O across multiple cores. No friggin' clue yet but
looks feasible.

And here's a dinky change to avoid re-allocing the polls on every
call in a moment. conn_cur is capped by max_connections, so the free-and-realloc
of the pollfd's can be done once, and a memset for the current connections 
zero'ed
instead

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 8:49

GoogleCodeExporter commented 9 years ago

If you wish to work on this, I will be happy to accept performance-improving 
patches.

My impression is that in reality, the bottleneck is network I/O, and skipfish 
seldom has an appreciable CPU usage; this is why I haven't put too much effort 
into optimizing the internal data structures, data parsing, or memory 
allocation.

Original comment by lcam...@gmail.com on 5 Jul 2010 at 9:02

Added labels: Priority-Low, Type-Enhancement
Removed labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

Here's the patch to avoid an alloc/free within your main polling loop:

Index: http_client.c
===================================================================
RCS file: /v/rpm/cvs/skipfish/http_client.c,v
retrieving revision 1.1.1.1
diff -p -u -w -r1.1.1.1 http_client.c
--- http_client.c   4 Jul 2010 19:53:33 -0000   1.1.1.1
+++ http_client.c   5 Jul 2010 21:14:48 -0000

@@ -1922,19 +1927,21 @@ connect_error:
    routine for the scanning loop. */

 u32 next_from_queue(void) {
+  static struct pollfd* p;

   u32 cur_time = time(0);

+   if (p == NULL)
+      p = __DFL_ck_alloc(sizeof(struct pollfd) * max_connections);
+
   if (conn_cur) {
-    static struct pollfd* p;
     struct conn_entry* c = conn;
     u32 i = 0;

     /* First, go through all connections, handle connects, SSL handshakes, data
        reads and writes, and exceptions. */

-    if (p) free(p);
-    p = __DFL_ck_alloc(sizeof(struct pollfd) * conn_cur);
+    memset(p, 0, sizeof(struct pollfd) * conn_cur);

     while (c) {
       p[i].fd = c->fd;

I'll get some measurements eventually (but as you can tell I don't
yet have any reasonable comparison point established).

You are likely correct the network I/O is the bottleneck to skipfish per se.
But I'm hoping to steal your pivot store for RPM package management,
and am also (I maintain POPT library) gonna use skipfish to automate

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 9:18

GoogleCodeExporter commented 9 years ago

(interrupted, continuing)
    --OPTION KEY=VALUE,KEY=VALUE,...
processing (skipfish has several "mappings" of this form).

I also am questing spidering code in C so that I can index *.rpm
packages into a Cassandra distributed store. That's an entirely
different usage case than, say, installing 404's into web logs ;-)

But I can/will likely send along performance patches as I get up to speed
on your skipfish code. Just expect a few brain farts from me, feel free to 
ignore.

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 9:22

GoogleCodeExporter commented 9 years ago

I incorporated that strchr and poll() tweaks.

Original comment by lcam...@gmail.com on 5 Jul 2010 at 10:12

GoogleCodeExporter commented 9 years ago

Cool. Meanwhile here is a _HIGHLY_ speculative patch (all I can tell is that it 
doesn't
segfault) to parallelize your next_from_queue() polling loop using -fopenmp. 
I've parallelized a couple
loops, but I am by no means an expert at using -fopenmp.

But if you can parallelize your service model, you should be able to crank
even more performance from skipfish (albeit mostly by reducing inter-request
service latency which may not be too large a win).

@@ -1947,10 +1955,20 @@ u32 next_from_queue(void) {

     poll(p, conn_cur, 100);

+    cptrs = ck_alloc(conn_cur * sizeof(c));
     c = conn;
+    for (i=0;i<conn_cur;i++) {
+   cptrs[i] = c;
+   c = c->next;
+    }

+#if defined(_OPENMP)
+#pragma omp parallel for \
+    
shared(cptrs,req_retried,req_errors_net,req_errors_cur,bytes_sent,conn_idle_tmou
t,tear_down_idle) private(c,i)
+#endif
     for (i=0;i<conn_cur;i++) {

+      c = cptrs[i];
       struct conn_entry* next = c->next;

       /* Connection closed: see if we have any pending data to write. If yes,
@@ -2164,9 +2182,8 @@ network_error:

       }

-      c = next;
-
     }
+    free(cptrs);

   }

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 10:32

GoogleCodeExporter commented 9 years ago

And I can't claim anything because no -fopenmp. News shortly ...

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 10:42

GoogleCodeExporter commented 9 years ago

Now confirmed "Compiled with -fopenmp" and "skipfish doesn't segfault". Here's
the cchegrind (I keep saying callgrind) verifier that indeed, next_from_queue()
is parallelized:

1,203,893,480  *  http_client.c:next_from_queue [/X/src/wdj/skipfish/skipfish]
      140,864  >   ???:poll (3407x) [/lib/libc-2.11.1.so]
    2,426,977  >   ???:GOMP_parallel_end (3407x) [/usr/lib/libgomp.so.1.0.0]
    4,205,159  >   ???:GOMP_parallel_start (3407x) [/usr/lib/libgomp.so.1.0.0]
          546  >   /X/src/wdj/skipfish/alloc-inl.h:__DFL_ck_alloc (1x)
  121,703,883  >   http_client.c:conn_associate (10007x) [/X/src/wdj/skipfish/skipfish]
       51,120  >   ???:time (3408x) [/lib/libc-2.11.1.so]
      459,809  >   ???:free (3407x) [/lib/libc-2.11.1.so]
        6,906  >   ???:_dl_runtime_resolve (4x) [/lib/ld-2.11.1.so]
3,914,728,145  >   http_client.c:next_from_queue.omp_fn.0 (3407x) 
[/X/src/wdj/skipfish/skipfish]
      136,109  >   ???:__memset_sse2 (3407x) [/lib/libc-2.11.1.so]

Any suggestions for better loops to parallelize or better invocation test cases?

/usr/bin/time with/without --enable-openmp is essentially identical.

crap ... forget to turn on multiple cpu's in VMFuxion ... hang on ...

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 11:12

GoogleCodeExporter commented 9 years ago

(please take all these measurements as _HIGHLY_ speculative and -r 10000 
limited)

OK, on a 4-way VM I'm seeing w /usr/bin/time

    296.43user 16.15system 1:42.24elapsed 305%CPU (0avgtext+0avgdata 13648maxresident)k
    960inputs+0outputs (4major+1174minor)pagefaults 0swaps

Same code on a 1-way VM

    1.03user 5.01system 1:29.90elapsed 6%CPU (0avgtext+0avgdata 10736maxresident)k
    0inputs+5112outputs (0major+1432minor)pagefaults 0swaps

And compiled with --disable-openmp

    0.92user 4.76system 1:33.53elapsed 6%CPU (0avgtext+0avgdata 10192maxresident)k
    0inputs+5112outputs (0major+1391minor)pagefaults 0swaps

I'm also seeing HTTP requests kB/s ~2x larger (but this is _NOT_ a careful 
measurement at all,
I reserve the right to deny everything ;-).

So it looks _FEASIBLE_ to use -fopenmp to attempt multi-threaded I/O more 
intelligently
down the road a bit.

Original comment by n3npq....@gmail.com on 5 Jul 2010 at 11:28

GoogleCodeExporter commented 9 years ago

Ah there's the segfaults (as I expected). Most of the issues (from looking at 
3-4 of the
segfaults) have to do with thread local storage being realloc'd within thread 
and
than accessed/free'd out of the loop. That's _FEASIBLY_ fixable in my book, but 
"Caveat emptor"
applies.

Original comment by n3npq....@gmail.com on 6 Jul 2010 at 1:50

GoogleCodeExporter commented 9 years ago

i downloaded and installed cygwin for win7 and tried to makefile but not able 
to do anything ...keep getting message command not found.

Original comment by r13chau...@gmail.com on 21 Sep 2012 at 1:16

arifnyet / skipfish

callgrind measured hotspot elimination #77