OpenIdentityPlatform / OpenAM

OpenAM is an open access management solution that includes Authentication, SSO, Authorization, Federation, Entitlements and Web Services Security.
https://www.openidentityplatform.org/openam
Other
766 stars 149 forks source link

SessionConstraint may cause a hang #681

Closed sp193 closed 9 months ago

sp193 commented 9 months ago

Describe the bug In a new production system that we are setting up for our customer, AM appears to hang under specific conditions. The exact conditions for causing this hang isn't very clear, but I need to fix it. I will write what I know about it, in this report.

For at least now, I believe it is related to SessionConstraint by catching a never-ending loop in the logs, related to calls to refresh(false) within DestroyOldestSession. The transaction ID remains the same, so I believe it is the same request that is never ending. Since this runs in Kubenetes, I couldn't really figure out how to attach a debugger to it... so it is regretable that nothing more concrete can be determined yet. I will explain why I think this is so, below. I will also paste the relevant log sections in another comment.

To Reproduce Steps to reproduce the behavior:

  1. Have 2FA, session constraints enabled. I use "Destroy oldest" strategy and limited the number of sessions to 1. I do not know if 2FA matters.
  2. Log in and out of the system.
  3. After some time, it may encounter excessive memory use and/or become unresponsive. After some time, it will likely crash due to OOM.

Expected behavior OpenAM should enforce session constraints without issue.

Screenshots Stack trace, in one of the threads that trigger never-ending reading of tokens: am-hang-infinite-session-load

In other case (not in the same run), there are too many instances of LDAPConnectionFactory etc, which is related to equally many instances of CachedConnectionPool: 69K lots - LDAP

Personally, I think that the memory issue is a symptom of the actual problem and is not really the problem itself. I don't really have strong evidence for this part, but I remember seeing connection pools getting created in the loop. If it happens fast enough, perhaps they are getting created faster than OpenDJ can handle.

Desktop (please complete the following information):

Additional context

I commented out this loop that was added here, like this. And the problem appears to have gone away:

...
         // Step 3: checking the constraints
            if (sessions != null && SystemProperties.getAsBoolean("org.openidentityplatform.openam.cts.quota.exhaustion.enabled", true)) {
                    sessions.remove(internalSession.getSessionID().toString());
                    //while (sessions.size() >= quota) {
                        reject = getQuotaExhaustionAction().action(internalSession, sessions);
                        //if (reject)
                            //break;
                    //}
...

However, I will only know for sure whether this matters, tomorrow. Usually, it happens very often whenever my customer is at work. I don't know why, but I assume it is caused by more sessions getting created and destroyed.

Within DestroyOldestAction, the action() method now looks like this:

    public boolean action(InternalSession is, Map<String, Long> sessions) {
        long smallestExpTime = Long.MAX_VALUE;
        String oldestSessionID = null;
        for (Map.Entry<String, Long> entry : sessions.entrySet()) 
            if (!StringUtils.equals(is.getSessionID().toString(), entry.getKey())){
                try {
                    Session session = new Session(new SessionID(entry.getKey()));
                    session.refresh(false);
                    long expTime = session.getTimeLeft();
                    if (expTime <= smallestExpTime) {
                        smallestExpTime = expTime;
                        oldestSessionID = entry.getKey();
                    }
                } catch (SessionException ssoe) {}
            }
        if (oldestSessionID != null) { 
            destroy(oldestSessionID,sessions);
        }
        return false;
    }

There is a call to refresh(), as seen inside the stack trace above.

If we go back further (before 9f368141), this same method looked like this at some point:

if (oldestSessionID != null) {
            SessionID sessID = new SessionID(oldestSessionID);
            try {
                Session s = sessionCache.getSession(sessID);
                debug.warning("destroy {}", sessID);
                s.destroySession(s);
            } catch (SessionException e) {
                if (debug.messageEnabled()) {
                    debug.message("Failed to destroy the next expiring session.", e);
                }
                // deny the session activation request
                // in this case
                return true;
            }

The exception flow may occur if the token no longer exists. The Session.refresh() method would throw this exception:

        try {
            RestrictedTokenContext.doUsing(activeContext,
                    new RestrictedTokenAction() {
                        public Object run() throws Exception {
                            doRefresh(reset);
                            return null;
                        }
                    });

        } catch (Exception e) {
            sessionCache.removeSID(sessionID);
            if (sessionDebug.messageEnabled()) {
                sessionDebug.message("session.Refresh " + "Removed SID:"
                        + sessionID);
            }
            throw new SessionException(e);
        }

This means that no token can be removed. However, the SessionConstraint class now has a loop, that will never exit if there is not at least 1 session destroyed.

I think that the original logic without the loop existed, because of this possibility that the session cannot be destroyed because it was already destroyed. In the original version, the creation of a new session would be blocked too. But because of the code change by in 36677ca, it also now means that the session may always be created (if the loop in SessionConstraint does exit). I think, that could be permissible, if we can assume that the mechanism always works: the outcome is that the session is destroyed or that there is just no valid session to destroy. If not, then we may need to revert this change too.

But because there are no comments on why these changes were made in the first place, I do not know how to suggest a proper fix. No longer blocking session creation, if old session cannot be destroyed: https://github.com/OpenIdentityPlatform/OpenAM/blame/02b7dd00e64302c7efe90a35fb43946c346dbb8c/openam-core/src/main/java/org/forgerock/openam/session/service/DestroyOldestAction.java#L82 Why exactly, do we need this loop? https://github.com/OpenIdentityPlatform/OpenAM/blame/28f00ee2f3601c5b2e22388f4b786abd5aafae78/openam-core/src/main/java/com/iplanet/dpro/session/service/SessionConstraint.java#L158

@vharseko would you be able to help explain why these changes were made in such a way, so that I can help to suggest a fix? Or would you already know what to do? Thank you.

sp193 commented 9 months ago

Within CoreSystem, there is this loop that appears. The only difference is the suffix of the transactionId attribute of the query, which makes me believe that a single, same request got stuck.

tokenDataLayer:12/02/2023 12:10:54:134 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Queuing Task ReadTask: -8136762284537127581
tokenDataLayer:12/02/2023 12:10:54:146 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Queuing Task QueryTask: TokenFilter: Filter: [coreTokenMultiString01 eq "AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*"] Attributes: 
tokenDataLayer:12/02/2023 12:10:54:149 PM UTC: Thread[CTSWorkerPool-2,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Creating Embedded Factory:
URL: [ldap://opendj:50389]
Max Connections: 10
Heartbeat: 10
Operation Timeout: 10
tokenDataLayer:12/02/2023 12:10:54:152 PM UTC: Thread[CTSWorkerPool-2,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Query: matched 0 results
Search Request: SearchRequest(name=ou=famrecords,ou=openam-session,ou=tokens,dc=openam,dc=forgerock,dc=org, scope=sub, dereferenceAliasesPolicy=never, sizeLimit=0, timeLimit=0, typesOnly=false, filter=(&(coreTokenMultiString01=AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.\2AAAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..\2A)(objectClass=frCoreToken)), attributes=[*, etag], controls=[TransactionIdControl(oid=1.3.6.1.4.1.36733.2.1.5.1, transactionId=3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826/153959)])
Filter: (&(coreTokenMultiString01=AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.\2AAAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..\2A)(objectClass=frCoreToken))
Result: Result(resultCode=Success, matchedDN=, diagnosticMessage=, referrals=[], controls=[])
tokenDataLayer:12/02/2023 12:10:54:153 PM UTC: Thread[CTSWorkerPool-2,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Query: Matched 0, some Tokens are shown below:

Within Session, there is this loop that also bears the same transactionId:

amCoreTokenService:12/02/2023 12:10:29:561 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Read: queued -8136762284537127581
amCoreTokenService:12/02/2023 12:10:29:564 PM UTC: Thread[CTSWorkerPool-0,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-0]
CTS Async: Task Processor: process Task org.forgerock.openam.sm.datalayer.impl.SeriesTaskExecutor$AuditRequestContextPropagatingTask@62508546
amCTSAsync:12/02/2023 12:10:29:566 PM UTC: Thread[CTSWorkerPool-0,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Received: results: <null>
amCTSAsync:12/02/2023 12:10:29:567 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Results: <null>
amCoreTokenService:12/02/2023 12:10:29:569 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Read: no Token found for -8136762284537127581
amCoreTokenService:12/02/2023 12:10:29:570 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Token -8136762284537127581 did not exist
amCoreTokenService:12/02/2023 12:10:29:572 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Query: TokenFilter: Filter: [coreTokenMultiString01 eq "AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*"] Attributes: 
amCoreTokenService:12/02/2023 12:10:29:573 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Query: queued with Filter: TokenFilter: Filter: [coreTokenMultiString01 eq "AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*"] Attributes: 
amCoreTokenService:12/02/2023 12:10:29:576 PM UTC: Thread[CTSWorkerPool-5,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-0]
CTS Async: Task Processor: process Task org.forgerock.openam.sm.datalayer.impl.SeriesTaskExecutor$AuditRequestContextPropagatingTask@671a528e
amCTSAsync:12/02/2023 12:10:29:583 PM UTC: Thread[CTSWorkerPool-5,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Received: results: []
amCTSAsync:12/02/2023 12:10:29:585 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS Async: Results: []
amCoreTokenService:12/02/2023 12:10:29:586 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
CTS: Query: returned 0 Tokens with Filter: TokenFilter: Filter: [coreTokenMultiString01 eq "AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*"] Attributes: 
amSession:12/02/2023 12:10:29:588 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
session.Refresh Removed SID:AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*
amSession:12/02/2023 12:10:29:589 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*: LocalOperations selected.
amSession:12/02/2023 12:10:29:590 PM UTC: Thread[http-nio-9090-exec-4,5,main]: TransactionId[3f4a4eee-7166-44fb-925c-8d578f8c94ae-9826]
Local fetch SessionInfo for AQIC5wM2LY4SfcwV282pLmPmxFjnwCrkto2w86_1K7BU9nU.*AAJTSQACMDEAAlNLABQtODEzNjc2MjI4NDUzNzEyNzU4MQACUzEAAA..*
Reset: false

What I understand from the Session log, and studying the code with reference to the stack trace: "CTS: Query: returned 0 Tokens with Filter" is printed because there are no matches for the given token. "Removed SID" is printed here by Session. The exception that is caught by Session is thrown here by LocalOperations.

vharseko commented 9 months ago

please provide 1) version ? 2) jstack PID >> thread.txt (multiple files)

sp193 commented 9 months ago

Thank you for looking into this. Apologies for taking a while to get back to you, but I have been getting increasingly busy at work. Since the only platform I have which can rather reliably replicate this is a soon-to-be production environment, I can only test the new commit over the weekend (which is very soon). I think that your change does look like a way to solve this problem of being unable to destroy a session which cannot be destroyed.

I made a heap dump before (.hprof), which seems to have the status of all threads. Would this do? https://www.sendspace.com/file/e4be8p As mentioned above, one of the threads stuck in the loop was http-nio-9090-exec-4.

The version of OpenAM is this repository's head, a 14.8.1 snapshot.

sp193 commented 9 months ago

Hi again. It has been 2 days and I think the patch has been working well. Closing as fixed by #682. Thank you.