criblio / appscope

Gain observability into any Linux command or application with no code modification
https://appscope.dev
Apache License 2.0
267 stars 33 forks source link

Performance #968

Closed jrcheli closed 1 year ago

jrcheli commented 2 years ago

image

jrcheli commented 2 years ago

I've circled three areas. From left to right: 1) pcre2 in datapath 2) memcopy in datapath 3) lockless list operations on our periodic thread

These are good candidates for further investigation. Thanks to @michalbiesek for creating this awesome flame-graph visualization using perf!

jrcheli commented 2 years ago

@michalbiesek has provided the raw data (better than my screen shot above) here: example_nginx_bench

As we look at pcre2 stuff, this may help: https://www.pcre.org/current/doc/html/pcre2perform.html

iapaddler commented 2 years ago

looking at low level perf. starting at the bottom and working up. all values in ns.

Go 17 TLS dynamic client: (implies SSL processing and TLSTX & TLSRX processing) with an HTTP2 preface string:

HTTP2 frames post detection:

Go 17 HTTP1 plain client dynamic: (GODEBUG=http2client=0 ldscope ./plainClientDynamic17)

GET request, separate connection

iapaddler commented 2 years ago
iapaddler commented 2 years ago

Note that there is a possibility that an h2 frame is seen before the preface string is seen. This will cause the h2 channel to be marked as not HTTP. We have seen this happen in nginx when the h2load client is used. Example below. We have not committed to addressing this until we determine if this is limited. Also found the same behavior with curl. However, in both cases, the response from nginx results are not as expected. Both the h2load and curl requests result in way too many requests. The curl client literally hangs. This occurs with and without nginx being scoped. We will need to resolve this issue before going further with the out of order h2 frame.

$ curl -k --http2-prior-knowledge https://localhost/

$ h2load -c 1 -n 1 https://127.0.0.1

Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500] DEBUG: doProtocol(id=-1, fd=3, len=69, src=NETTX, dtyp=BUF) TLS=TRUE PROTO=PENDING (64 bytes)
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0000:  17 03 03 00 40 b5 6f da  92 5f 0e 71 70 bd 55 f1  ....@.o.._.qp.U.
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0010:  0a 9d d8 48 3a 05 8d fc  b6 bc 9d f0 bd 2c ba 20  ...H:........,. 
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0020:  fb bd 47 bb be b8 d5 7e  1d 82 41 8e 57 40 ff f0  ..G....~..A.W@..
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0030:  88 96 27 fd e3 0e cf f6  f5 f5 c1 e5 07 c9 06 0f  ..'.............
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500] DEBUG: doProtocol(id=94376971901872, fd=3, len=40, src=TLSTX, dtyp=BUF) TLS=TRUE PROTO=PENDING (40 bytes)
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0000:  ****00 00 12 04 00 00 00 00  00**** 00 03 00 00 00 80 00  ................
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0010:  04 00 01 00 00 00 05 00  ff ff ff 00 00 04 08 00  ................
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0020:  00 00 00 00 7f ff 00 00                           ........
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500] DEBUG: doProtocol(id=-1, fd=3, len=163, src=NETRX, dtyp=BUF) TLS=TRUE PROTO=FALSE (64 bytes)
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0000:  17 03 03 00 78 42 90 34  6c 9f 23 65 9f ac 18 8f  ....xB.4l.#e....
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0010:  4f da a1 cb 77 06 fe f1  90 26 90 e0 1b 47 b0 ce  O...w....&...G..
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0020:  89 73 4d 54 6d 9d f0 20  4f 50 77 e6 45 74 f3 62  .sMTm.. OPw.Et.b
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0030:  cf 17 af ff b6 2f a8 4b  35 6a 9a 1e 6e 4a bd 70  ...../.K5j..nJ.p
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500] DEBUG: doProtocol(id=94376971901872, fd=3, len=96, src=TLSRX, dtyp=BUF) TLS=TRUE PROTO=FALSE (64 bytes)
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0000:  50 52 49 20 2a 20 48 54  54 50 2f 32 2e 30 0d 0a  PRI * HTTP/2.0..
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0010:  0d 0a 53 4d 0d 0a 0d 0a  00 00 0c 04 00 00 00 00  ..SM............
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0020:  00 00 02 00 00 00 00 00  04 3f ff ff ff 00 00 04  .........?......
Scope: nginx: master process nginx(pid:126143): [2022-06-17T11:35:24.200-0500]   0030:  08 00 00 00 00 00 3f ff  00 00 00 00 1d 01 05 00  ......?.........
iapaddler commented 2 years ago

HTTP2 perf: The focal point for HTTP performance, we suggest, is the overhead attached to the data path. We suggest that the amount of CPU utilized by the periodic thread, for example, to export events and metrics is not the focal point as it is not in the execution path of the application. Worst case, the periodic thread consumes a virtual core. If your app has that headroom, it does not directly affect the application.

Therefore, a measurement with nanosecond resolution of the overhead incurred in the application data path is important. The fundamental interaction utilizing HTTP2 is the set of h2 frames involved in requests and responses. The number of frames is quite variable. The libscope code interacts with these frames at a socket level and at an SSL level. To provide context, a client download from gnu.org results in a 32k file being downloaded while libscope interacts with h2 frames 60 times.

On the server side, we scope nginx, hit it with wget2, where we see 19,107 h2 frames. The average time required to process these 19k frames is 316ns.

jrcheli commented 1 year ago

Just wanted to add another note w.r.t. pcre2: To resolve the possibility of stack overflows, we added stack switching for some pcre2 calls. See iIssue #1170 for more info.

While resolving this issue, we noticed that a server that received a single curl request ended up calling these wrapper functions well more than 700 times. I did not do perf measurements at this time, but this reaffirms that it might be one area to look at more closely. The vast majority of these 700+ calls were made by the reporting thread, but some were in the datapath as well.

To see this, in a server, you can use gdb to set a breakpoint on our wrapper functions (pcre2_match_wrapper and regexec_wrapper) then hit the server with a http request with curl.

michalbiesek commented 1 year ago

Just my few cents for @jrcheli comment above. Regarding the performance impact and PR #1392 maybe it is worth to explore case when we allocate the memory only one time during application lifecycle and keep the allocated memory without freeing it. I refer to following part of code used in pcre2_match_wrapper and regexec_wrapper

     if ((pcre_stack = scope_malloc(PCRE_STACK_SIZE)) == NULL) {
        scopeLogError("ERROR; regexec_wrapper: scope_malloc");
        return -1;
    }
jrcheli commented 1 year ago

@michalbiesek, good call. The last commit of Issue #1170 (PR #1392) implements what you were proposing, and was immediately observed to be very effective.