SASL GSSAPI & GSS-SPNEGO performance issue

tiran commented 3 years ago

Issue Description

Today I noticed again that LDAP SASL bind with GSSAPI and GSS-SPNEGO are sometimes slow. The issue is similar to https://pagure.io/freeipa/issue/6656. In about half the cases SASL bind is faster. In other cases the operation is more than a magnitude slower. I'm seeing the same performance spikes with GSS-SPNEGO. In fast case GSS-SPNEGO is a bit faster than GSSAPI. In slow cases it degrades like GSSAPI.

Package Version and Platform:

Platform: Fedora 32
Package and version: 389-ds-base-2.0.1-20201215gitae2da1018.fc32.x86_64 and 1.4.3.16-1.fc32, cyrus-sasl-2.1.27-4.fc32.x86_64, krb5-workstation-1.18.2-29.fc32.x86_64

Steps to Reproduce Steps to reproduce the behavior:

Install FreeIPA server
kinit -kt /etc/named.keytab DNS/$(hostname)
run time ldapwhoami -H ldapi://%2Frun%2Fslapd-IPA-TEST.socket -Y GSSAPI multiple times

Expected results Command finishes in around 20ms consistently.

Actual result

Wall clock time for the command is all over the board. Sometimes the command finishes in less than 20ms, sometimes it's around 250 and every know and then it takes more than 500ms.

real    0m0,030s
real    0m0,018s
real    0m0,017s
real    0m0,266s
real    0m0,267s
real    0m0,265s
real    0m0,017s
real    0m0,266s
real    0m0,264s
real    0m0,513s
real    0m0,266s
real    0m0,267s
real    0m0,017s
real    0m0,264s
real    0m0,266s
real    0m0,017s
real    0m0,266s
real    0m0,018s
real    0m0,264s
real    0m0,265s
real    0m0,016s
real    0m0,265s
real    0m0,266s
real    0m0,267s
real    0m0,267s
real    0m0,516s
real    0m0,018s
real    0m0,017s
real    0m0,266s

Additional context

For testing I have stop all IPA services except DS, disabled and removed all IPA plugins that may affect 389-DS performance, and restarted 389-DS. It did not affect performance.

fast bind

    [17/Dec/2020:12:18:24.059835167 +0100] conn=286 fd=109 slot=109 connection from local to /var/run/slapd-IPA-TEST.socket
    [17/Dec/2020:12:18:24.061934129 +0100] conn=286 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:24.064890914 +0100] conn=286 op=0 RESULT err=14 tag=97 nentries=0 wtime=0.000350396 optime=0.002962757 etime=0.003311237, SASL bind in progress
    [17/Dec/2020:12:18:24.065255570 +0100] conn=286 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:24.066715112 +0100] conn=286 op=1 RESULT err=14 tag=97 nentries=0 wtime=0.000065520 optime=0.001463546 etime=0.001527052, SASL bind in progress
    [17/Dec/2020:12:18:24.067034922 +0100] conn=286 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:24.068130645 +0100] conn=286 op=2 RESULT err=0 tag=97 nentries=0 wtime=0.000079306 optime=0.001160701 etime=0.001237869 dn="krbprincipalname=dns/server.ipa.test@ipa.test,cn=services,cn=accounts,dc=ipa,dc=test"
    [17/Dec/2020:12:18:24.068538087 +0100] conn=286 op=3 EXT oid="1.3.6.1.4.1.4203.1.11.3" name="whoami-plugin"
    [17/Dec/2020:12:18:24.068629931 +0100] conn=286 op=3 RESULT err=0 tag=120 nentries=0 wtime=0.000232724 optime=0.000098566 etime=0.000328895
    [17/Dec/2020:12:18:24.069395429 +0100] conn=286 op=4 UNBIND
    [17/Dec/2020:12:18:24.069450050 +0100] conn=286 op=4 fd=109 closed error - U1

slow bind (delay between op=1 and op=2)

    [17/Dec/2020:12:18:36.625205796 +0100] conn=292 fd=109 slot=109 connection from local to /var/run/slapd-IPA-TEST.socket
    [17/Dec/2020:12:18:36.627115266 +0100] conn=292 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:36.630209431 +0100] conn=292 op=0 RESULT err=14 tag=97 nentries=0 wtime=0.000370102 optime=0.003100261 etime=0.003468232, SASL bind in progress
    [17/Dec/2020:12:18:36.630610597 +0100] conn=292 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:36.632120068 +0100] conn=292 op=1 RESULT err=14 tag=97 nentries=0 wtime=0.000076306 optime=0.001513595 etime=0.001587760, SASL bind in progress
    [17/Dec/2020:12:18:36.881177050 +0100] conn=292 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
    [17/Dec/2020:12:18:36.882648579 +0100] conn=292 op=2 RESULT err=0 tag=97 nentries=0 wtime=0.000174524 optime=0.001485382 etime=0.001657851 dn="krbprincipalname=dns/server.ipa.test@ipa.test,cn=services,cn=accounts,dc=ipa,dc=test"
    [17/Dec/2020:12:18:36.883102483 +0100] conn=292 op=3 EXT oid="1.3.6.1.4.1.4203.1.11.3" name="whoami-plugin"
    [17/Dec/2020:12:18:36.883221013 +0100] conn=292 op=3 RESULT err=0 tag=120 nentries=0 wtime=0.000229879 optime=0.000150909 etime=0.000378423
    [17/Dec/2020:12:18:36.884186770 +0100] conn=292 op=4 UNBIND
    [17/Dec/2020:12:18:36.884207276 +0100] conn=292 op=4 fd=109 closed error - U1

At one point I also enabled tracing to trace function calls into SASL stack. There is a long delay between the first and second ids_sasl_listmech call. You can find a full log at https://cheimes.fedorapeople.org/sasl.log

[17/Dec/2020:12:58:38.133255294 +0100] - DEBUG - ids_sasl_server_new - => (vm-171-075.abc.idm.lab.eng.brq.redhat.com)
[17/Dec/2020:12:58:38.145986011 +0100] - DEBUG - ids_sasl_getopt - plugin= option=log_level
[17/Dec/2020:12:58:38.147450940 +0100] - DEBUG - ids_sasl_getopt - plugin= option=auto_transition
[17/Dec/2020:12:58:38.149436807 +0100] - DEBUG - ids_sasl_getopt - plugin= option=mech_list
[17/Dec/2020:12:58:38.157977606 +0100] - DEBUG - ids_sasl_server_new - <=
[17/Dec/2020:12:58:39.043715851 +0100] - DEBUG - ids_sasl_listmech - =>
[17/Dec/2020:12:58:39.045289486 +0100] - DEBUG - ids_sasl_listmech - <=
[17/Dec/2020:12:58:39.841690920 +0100] - DEBUG - ids_sasl_listmech - =>
[17/Dec/2020:12:58:39.843231866 +0100] - DEBUG - ids_sasl_listmech - <=

tiran commented 3 years ago

My gut feeling tells me that the issue could be related to global lock. I see a log of activity related to replication and index management in the trace log. ids_sasl_check_bind() is protected by the big lock pthread_mutex_lock(&(pb_conn->c_mutex)); /* BIG LOCK */

Firstyear commented 3 years ago

I doubt it's the lock you mention here - more likely it's related to https://github.com/389ds/389-ds-base/issues/3032

The change mentioned here has to add a global krb lock to resolve an issue/race with freeing memory, and it was noted at the time that significant performance issues may arise from this change. Certainly this would also have interesting interleaving effects with the conn lock. The conn lock however, is correctly placed and is per-connection.

It was also discussed that the "long term" solution would be to move to GSSAPI as the api rather than krb directly, but that is an extensive, invasive and complex piece of work.

Firstyear commented 3 years ago

See commit https://github.com/389ds/389-ds-base/commit/ff9387bb97c306ad9c7d430bf58e094333c4f1b3

Firstyear commented 3 years ago

As a work around you may find that reducing your worker thread count to match the number of available CPU's will increase performance due to a reduction in lock contention.

tiran commented 3 years ago

Thanks @Firstyear!

I have done my best to isolate the system, reduce any interference, and avoid lock contention from external sources.

Test host is a VM on RHEVM with 4 dedicated CPUs and 8 GB allocated RAM. The system is idling with load average 0.06 0.02 0.00 during a test.
The test setup has no replicas or clients enrolled. There are no external hosts or internal services that access 389-DS. cn=monitor connection attribute shows one connection.
I have stopped all FreeIPA services (ipactl stop, systemctl stop sssd.service) after I acquired the Kerberos service ticket. KRB5 authentication works without a running KDC when the client has the service ticket cached.
I'm running ldapwhoami sequentially.

tiran commented 3 years ago

slapi_ldap_bind performs serialization for mech GSSAPI only. I can reproduce slow bind with mech GSS-SPNEGO, too.

https://github.com/389ds/389-ds-base/blob/0f38410ae6fcd873b4ef9be0d98c2bd26e15c5b9/ldap/servers/slapd/ldaputil.c#L1157-L1159

tiran commented 3 years ago

I just realized that the problem may also be on the client side. SASL GSSAPI authentication takes multiple roundtrips. The client may be slow reading or responding to SASL responses. I'm also seeing the same issue with python-ldap, which is just a wrapper for OpenSSL and Cyrus SASL client libs.

We need to keep in mind that it might be the client side or an issue in SASL library on either server or client side.

tiran commented 3 years ago

I noticed that the performance of SASL bind operation has a peculiar distribution. It's not random at all. Slow bind operations are exactly a multiple of 250ms slower. This cannot be a coincidence. It looks like a sleep(0.250) or a background thread that blocks SASL for multiple of 250ms.

3ms     329
4ms     1004
5ms     52
6ms     1
253ms   182
254ms   324
255ms   66
256ms   3
263ms   1
502ms   1
503ms   26
504ms   9
505ms   1
753ms   1

2000 SASL GSSAPI bind over LDAPI. I only measured the actual bind operation with a perf counter:

def profile(uri, meth):
    conn = ldap.initialize(uri)
    conn.whoami_s()  # start connection
    start = time.perf_counter()
    try:
        conn.sasl_interactive_bind_s('', meth)
        return time.perf_counter() - start
    finally:
        conn.unbind()

tiran commented 3 years ago

It's the wakeup timer! I changed the wakeup period from 250 to 25 and the performance of SASL bind operations changed dramatically. The fastest connections are a bit slower and slower operations are now quantized with multiple of 25ms.

Is it possible that some SASL calls are either blocked by or executed in the main thread?

--- a/ldap/servers/slapd/daemon.c
+++ b/ldap/servers/slapd/daemon.c
@@ -61,9 +61,9 @@
 #endif /* ENABLE_LDAPI */

 #if defined(LDAP_IOCP)
-#define SLAPD_WAKEUP_TIMER 250
+#define SLAPD_WAKEUP_TIMER 25
 #else
-#define SLAPD_WAKEUP_TIMER 250
+#define SLAPD_WAKEUP_TIMER 25
 #endif

 int slapd_wakeup_timer = SLAPD_WAKEUP_TIMER; /* time in ms to wakeup */

6ms     896
7ms     230
8ms     19
9ms     7
10ms    1
11ms    1
12ms    1
13ms    3
14ms    3
15ms    1
28ms    4
29ms    322
30ms    377
31ms    45
32ms    6
33ms    1
34ms    2
35ms    2
36ms    4
37ms    3
38ms    1
52ms    33
53ms    30
54ms    4
55ms    2
61ms    1
77ms    1

Firstyear commented 3 years ago

Riggghhhttttt interesting, that would indicate to me that there is a problem with sasl/krb from indicating that the socket is ready for work.

tiran commented 3 years ago

I have tried a different patch that only modifies the timout of POLL_FN (PR_Poll()). I'm now convinced that the performance issue is related to the core connection handling loop in slapd_daemon() function:

https://github.com/389ds/389-ds-base/blob/0f38410ae6fcd873b4ef9be0d98c2bd26e15c5b9/ldap/servers/slapd/daemon.c#L989-L1012

tbordaz commented 3 years ago

This is a very nice finding. I have to admit I am a bit puzzled.

reducing pr_timeout to 0.025s means that if no operation (like sasl_bind) comes in within 0.025s (on an established connection), it will timeout/recompute the fd set/poll again. So the gain is to take into account faster new incoming connections. My understanding would be that

3ms     329
4ms     1004
5ms     52
6ms     1
253ms   182
...

~1380 (out of 2000) connections were established on the first poll and their sasl_bind processed. remaining connections (e.g. 182) had to wait for poll timeout to be taken into account. The lack of response time is also due because no new operation come in on the first 1380 connections (after sasl_bind).

Is it possible to test a scenario with 2000 established connection, then 2000 sasl bind, to confirm if pr_timeout is impacting or not the response time ?

At the moment, I have the feeling that reducing the timeout improve the response time of new incoming connections (and established connections are idle (after sasl_bind)).

The benefit of response time will have the cost of the recomputation of the established connections.

tiran commented 3 years ago

I choose 25ms for testing only. My hack is not a proper solution.

You can find my trivial test script at https://gist.github.com/tiran/4b56faf5a8b14b9828ef8ba2ad9292ba

Firstyear commented 3 years ago

i think the problem here is that if there is no new work/delay on the POLL_FN and we wait up to pr_timeout, then we are delaying the call to handle_listeners to accept new connections. Then in handle_new_connection when we call to accept_and_configure we again, call PR_Accept with pr_timeout set to the slapd_wakeup_timer.

So POLL_FN is not polling the listeners, only connections in the CT. So that then cascades to

while accepting, we are not polling the existing connections for new work.
while polling for new work, we are delaying the acceptance of new connections.
while accepting we may also need to timeout before we can go back to polling.

Which certainly would align to @tiran's observations that binds are delayed sporadicly, because a bind generally is a new connection, and since it would complete "rapidly" then the poll for new work would hit timeout because there is no longer term connections active in the conntable.

The solution is probably not trivial, but we could consider breaking out each fd listener for accept to a unique thread and changing their timeout to PR_INTERVAL_NO_TIMEOUT so that they wake and accept immediately on a new connection, and the core select loop is only for the CT and has no interaction with PR_Accept.

A slightly more "palatable" intermediate fix is rather than per-thread for each listener, to break out accept to a single thread and have it iterate on just accepts. IE move handle_listeners to a unique thread in a while (!g_get_shutdown()) loop.

@tbordaz Thoughts?

tbordaz commented 3 years ago

i think the problem here is that if there is no new work/delay on the POLL_FN and we wait up to pr_timeout, then we are delaying the call to handle_listeners to accept new connections. Then in handle_new_connection when we call to accept_and_configure we again, call PR_Accept with pr_timeout set to the slapd_wakeup_timer.

So POLL_FN is not polling the listeners, only connections in the CT. So that then cascades to
....
@tbordaz Thoughts?

This is exactly what puzzled me. IMHO listeners are polled. ct->fd (polled array) is filled with n_tcps, s_tcps, i_unix, then with established & !gettingber connections. Before diving into patch review, do you mind showing me in which cases listeners are ignored from the poll ?. Also note that changing only deamon/poll pr_timeout improves response time so I am not sure how pr_timeout in accept_and_configure contribute.

Firstyear commented 3 years ago

@tbordaz Yeah, @tiran reported my hack fix didn't work, so I'm going to re-read to see what I misunderstood. Something seems wrong here, but I think it's because we combine accept + read io events in the one poll loop.

tbordaz commented 3 years ago

slapi_ldap_bind performs serialization for mech GSSAPI only. I can reproduce slow bind with mech GSS-SPNEGO, too.

https://github.com/389ds/389-ds-base/blob/0f38410ae6fcd873b4ef9be0d98c2bd26e15c5b9/ldap/servers/slapd/ldaputil.c#L1157-L1159

@tiran, this portion of code is specific to GSSAPI auth on outgoing connection (replication agreements, chaining). It is not related to the reported concern (response time of incoming GSSAPI auth).

progier389 commented 3 years ago

About the discussion we got on https://github.com/389ds/389-ds-base/pull/4516 about a possible scenario involving connection lock: I think I took the problem by the wrong way with my first proposal: we need to interrupt the poll only when a connection unlocks but only if its fd was skipped because the lock was busy when building the poll table ==> we should rather flag the connection when skipping its fd then if connection is flagged at unlock time, reset the flag and signal. The change is not very complex even if lots of mutex unlock has to be replaced by a call to a new function. And it is better than reducing the timeout (the trade off about the good timeout value is always a pain ...): If it is really the root cause we should get back to a few ms delay for all operations ...

Firstyear commented 3 years ago

The problem becomes setting the flag on the connection that we have previously skipped it because it was busy. That means we either need to acquire the conn lock (but then we'd be able to schedule it), use an extra mutex (which itself could be locked as the connection is trying to unlock and interogate it's status), or we need an atomic flag which poses some race conditions around the connection reading the atomic and unsetting it.

And as you say, it means we would need to find the places where we are releasing the lock to check if we had been skipped. I think there are about 90 places where we unlock the connection ...

If we were to adopt this approach I'd want to do it in a seperate PR to the initial threaded accept one.

progier389 commented 3 years ago

The problem becomes setting the flag on the connection that we have previously skipped it because it was busy. That means we either need to acquire the conn lock (but then we'd be able to schedule it), use an extra mutex (which itself could be locked as the connection is trying to unlock and interogate it's status), or we need an atomic flag which poses some race conditions around the connection reading the atomic and unsetting it.

We cannot acquire the lock (that the whole point) but IMHO there is no real issue with atomic operations: a fetch_and_and should take care of the test and unset case in a single atomic operation. The risk is that we may sometime signal the wrong poll loop but it is not really a problem as far as it occurs rarely

And as you say, it means we would need to find the places where we are releasing the lock to check if we had been skipped. I think there are about 90 places where we unlock the connection ...

If we were to adopt this approach I'd want to do it in a seperate PR to the initial threaded accept one.

Yes, a separate PR would be due (anyway unless mistaken the threaded accept PR is already merged and closed ! -;) )

Firstyear commented 3 years ago

The problem becomes setting the flag on the connection that we have previously skipped it because it was busy. That means we either need to acquire the conn lock (but then we'd be able to schedule it), use an extra mutex (which itself could be locked as the connection is trying to unlock and interogate it's status), or we need an atomic flag which poses some race conditions around the connection reading the atomic and unsetting it.

We cannot acquire the lock (that the whole point) but IMHO there is no real issue with atomic operations: a fetch_and_and should take care of the test and unset case in a single atomic operation.

My concern is atomics aren't free. In the past few years we've certainly made huge improvements to thread-safety/memory-safety from swapping from unprotected int operations to combinations of mutex/atomics, but that always has a cost. So I'm sometimes wary to add more if there could be a different way to solve this. IE I think if we used epoll/mio/tokio instead, we would not need this since it would be properly event driven at that point.

The risk is that we may sometime signal the wrong poll loop but it is not really a problem as far as it occurs rarely
And as you say, it means we would need to find the places where we are releasing the lock to check if we had been skipped. I think there are about 90 places where we unlock the connection ...

If we were to adopt this approach I'd want to do it in a seperate PR to the initial threaded accept one.
Yes, a separate PR would be due (anyway unless mistaken the threaded accept PR is already merged and closed ! -;) )

No, the threaded accept PR is still open :) I'm going to do a cleanup of it today and resubmit just to ensure it's good to be merged.

mreynolds389 commented 3 years ago

We have a major problem. This broke LDAPI --> the installer, and CLI tools. Installer and CLI tools now hang indefinitely. Please fix this soon, or we will have to revert it.

Firstyear commented 3 years ago

@mreynolds389 Please revert it, I can't reproduce this so it may take me some time to investigate and solve. I will re-submit once I can reproduce and have a fix. Are there more details about the hang?

mreynolds389 commented 3 years ago

I took a stack trace:

Here are the only two threads of interest during the hang:

Thread 26 (Thread 0x7f9d9fdff700 (LWP 1639111)):
#0  0x00007f9df89a3aaf in poll () from target:/lib64/libc.so.6
#1  0x00007f9df867d416 in poll (__timeout=250, __nfds=2, __fds=0x7f9d9fdfcfc0) at /usr/include/bits/poll2.h:46
#2  _pr_poll_with_poll (pds=0x7f9da9a02000, npds=npds@entry=2, timeout=timeout@entry=250) at ../../.././nspr/pr/src/pthreads/ptio.c:4227
#3  0x00007f9df8680e29 in PR_Poll (pds=<optimized out>, npds=npds@entry=2, timeout=timeout@entry=250) at ../../.././nspr/pr/src/pthreads/ptio.c:4647
#4  0x0000555d51d61739 in accept_thread (vports=0x7ffd02f0a4f0) at ldap/servers/slapd/daemon.c:838
#5  0x00007f9df8682554 in _pt_root (arg=0x7f9df440ea00) at ../../.././nspr/pr/src/pthreads/ptthread.c:201
#6  0x00007f9df8614432 in start_thread () from target:/lib64/libpthread.so.0
#7  0x00007f9df89ae913 in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x7f9df792a2c0 (LWP 1639086)):
#0  0x00007f9df89a3aaf in poll () from target:/lib64/libc.so.6
#1  0x00007f9df867d416 in poll (__timeout=250, __nfds=1, __fds=0x7ffd02f0a050) at /usr/include/bits/poll2.h:46
#2  _pr_poll_with_poll (pds=0x7f9daf4ed240, npds=npds@entry=1, timeout=timeout@entry=250) at ../../.././nspr/pr/src/pthreads/ptio.c:4227
#3  0x00007f9df8680e29 in PR_Poll (pds=<optimized out>, npds=npds@entry=1, timeout=timeout@entry=250) at ../../.././nspr/pr/src/pthreads/ptio.c:4647
#4  0x0000555d51d60355 in slapd_daemon (ports=<optimized out>) at ldap/servers/slapd/daemon.c:1469
#5  0x0000555d51d53f7f in main (argc=5, argv=0x7ffd02f0a918) at ldap/servers/slapd/main.c:1119

I'll try and do some more testing tonight if I get time...

Firstyear commented 3 years ago

What test produced the issue?

mreynolds389 commented 3 years ago

What test produced the issue?

Running dscreate :-) Then I tried dsconf and it also just hung. Appears to be LDAPI specific... I rebuilding the server now to confirm the behavior. I should have confirmation soon...

mreynolds389 commented 3 years ago

dscreate hangs when it tries to bind as root:

DEBUG: open(): Connecting to uri ldapi://%2Fvar%2Frun%2Fslapd-localhost.socket
DEBUG: Using dirsrv ca certificate /etc/dirsrv/slapd-localhost
DEBUG: Using external ca certificate /etc/dirsrv/slapd-localhost
DEBUG: Using external ca certificate /etc/dirsrv/slapd-localhost
DEBUG: Using /etc/openldap/ldap.conf certificate policy
DEBUG: ldap.OPT_X_TLS_REQUIRE_CERT = 2
DEBUG: open(): Using root autobind ...

Same thing for dsconf. The server is not detecting/accepting the LDAPI connection

tbordaz commented 3 years ago

a4a53e1e2..f3bedfda4 Issue 4506 - BUG - fix oob alloc for fds (#4555) 9015bff23..e4f282e1a Issue 4506 - Temporary fix for io issues (#4516)

mreynolds389 commented 3 years ago

1e3f32dc8..fa46922b6 389-ds-base-1.4.4 -> 389-ds-base-1.4.4 f8f616b6a..ddc28036c 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

389ds / 389-ds-base

SASL GSSAPI & GSS-SPNEGO performance issue #4506