eclipse-ee4j / eclipselink

Eclipselink project
https://eclipse.dev/eclipselink/
Other
199 stars 169 forks source link

Eclipselink 2.7.x Tracing metadata - bug in method org.eclipse.persistence.internal.helper.ConcurrencyManager.putThreadAsWaitingToAcquireLockForWriting(Thread, String) - We end up with metadata on threads wanting to acquire write locks #2173

Open 99sono opened 4 months ago

99sono commented 4 months ago

Bug Report: Issue with putThreadAsWaitingToAcquireLockForWriting Method in EclipseLink 2.7.x

  1. Problem Description:

    • The implementation of the putThreadAsWaitingToAcquireLockForWriting method in EclipseLink 2.7.x is incorrect.
    • The method adds trace metadata about a thread wanting to acquire a write lock but immediately removes that same trace information.
    • Additionally, the library lacks the corresponding method for removing a thread no longer waiting to acquire a lock for writing.
  2. Affected Method:

    • org.eclipse.persistence.internal.helper.ConcurrencyManager.putThreadAsWaitingToAcquireLockForWriting(Thread, String)
    
    /**
     * Normally this mehtod should only be called from withing the concurrency manager.
     * However the write lock manager while it is building clones also does some while loop waiting
     * to try to acquire a cache key this acquiring logic is not being managed directly inside of the wait manager.
     *
     */
    public void putThreadAsWaitingToAcquireLockForWriting(Thread thread, String methodName) {
        THREADS_TO_WAIT_ON_ACQUIRE_NAME_OF_METHOD_CREATING_TRACE.put(thread, methodName);
        THREADS_TO_WAIT_ON_ACQUIRE_NAME_OF_METHOD_CREATING_TRACE.remove(thread);
    }

    Notice how this method is putting the trace metadata and then immediately removing it right after.

    This how this method is looking like in the inhoud modified 2.6.4 version.

    /**
     * Normally this mehtod should only be called from withing the concurrency manager. However the write lock manager
     * while it is building clones also does some while loop waiting to try to acquire a cache key this acquiring logic
     * is not being managed directly inside of the wait manager.
     *
     * @param methodNamCreatingTheTrace
     *            metadata to help us debug trace leaking. If we start blowing up threads we do not want the traces
     *            created by the current thread to remain.
     */
    public void putCurrentThreadAsWaitingToAcquireLockForWriting(String methodNamCreatingTheTrace) {
        Thread currentThread = Thread.currentThread();
        MAP_THREAD_TO_WAIT_ON_ACQUIRE.put(currentThread, this);
        MAP_THREAD_TO_WAIT_ON_ACQUIRE_NAME_OF_METHOD_CREATING_TRACE.put(currentThread, methodNamCreatingTheTrace);
    }
    
  3. Missing Removal Method:

    • The library lacks the following method for removing a thread no longer waiting to acquire a lock for writing:
      public void removeCurrentThreadNoLongerWaitingToAcquireLockForWriting() {
       Thread currentThread = Thread.currentThread();
       MAP_THREAD_TO_WAIT_ON_ACQUIRE.remove(currentThread);
       MAP_THREAD_TO_WAIT_ON_ACQUIRE_NAME_OF_METHOD_CREATING_TRACE.remove(currentThread);
      }
  4. Efficiency Concerns:

    • The current approach used by callers (such as org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone) to construct the trace string is inefficient.
    • The string stackTraceElement.getClassName() + "." + stackTraceElement.getMethodName() + "(...)" should either be cached in a static variable or computed only once.

Here is a preview of what the method currently looks like:

                // of the concurrency manager that we use for creating the massive log dump
                // to indicate that the current thread is now stuck trying to acquire some arbitrary
                // cache key for writing
                StackTraceElement stackTraceElement = Thread.currentThread().getStackTrace()[1];
                lastCacheKeyWeNeededToWaitToAcquire = toWaitOn;
                lastCacheKeyWeNeededToWaitToAcquire.putThreadAsWaitingToAcquireLockForWriting(currentThread, stackTraceElement.getClassName() + "." + stackTraceElement.getMethodName() + "(...)");
  1. Proposed Solution:
    • Revert to the original implementation used in our old Inhouse Patch Fixed EclipseLink 2.6.x version for adding and removing thread trace information.
    • Consider optimizing the trace string construction to improve efficiency.
  lastCacheKeyWeNeededToWaitToAcquire.putCurrentThreadAsWaitingToAcquireLockForReading(
                        "org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone(Object, ClassDescriptor, CacheKey, AbstractSession)");
  1. How We spotted this bug:

    • We are currently analyzing a new massive dump. This is a completely new pattern of deadlock we are investigating. Different from the MergeManager deadlock for which we have a different open issue. The current reason for the deadlock is still being investigated.

    • In this massive dump we had spotted a thread that had this stack trace

      [ACTIVE] ExecuteThread: '117' for queue: 'weblogic.kernel.Default (self-tuning)'" 
      java.lang.Thread.State: RUNNABLE
      at java.base@11.0.16/java.util.concurrent.ConcurrentHashMap.hashCode(ConcurrentHashMap.java:1302)
      at java.base@11.0.16/java.util.concurrent.ConcurrentHashMap.hashCode(ConcurrentHashMap.java:1302)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:651)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:584)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeFields(ThreadSnapshot.java:836)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeChildren(ThreadSnapshot.java:735)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:662)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:584)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeFields(ThreadSnapshot.java:836)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeChildren(ThreadSnapshot.java:735)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:662)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:584)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeFields(ThreadSnapshot.java:836)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeChildren(ThreadSnapshot.java:735)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:662)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:584)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeFields(ThreadSnapshot.java:836)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeChildren(ThreadSnapshot.java:735)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:662)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.writeVariable(ThreadSnapshot.java:584)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.access$100(ThreadSnapshot.java:40)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot$1$1.variable(ThreadSnapshot.java:408)
      at com.intergral.fusionreactor.plugin.jvmti.api.JvmtiApi.listVariables(Native Method)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot$1.frame(ThreadSnapshot.java:322)
      at com.intergral.fusionreactor.plugin.jvmti.api.JvmtiApi.listStack(Native Method)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshot.log(ThreadSnapshot.java:258)
      at com.intergral.fusionreactor.plugin.jvmti.thread.snapshot.ThreadSnapshotHandler.createThreadSnapshot(ThreadSnapshotHandler.java:383)
      at com.intergral.fusionreactor.plugin.logtracker.LogTrackerComponent.log(LogTrackerComponent.java:171)
      at com.intergral.fusionreactor.plugin.logtracker.LogTrackerImpl.fireLogEvent(LogTrackerImpl.java:58)
      at com.intergral.fusionreactor.plugin.logtracker.LogTrackerImpl.error(LogTrackerImpl.java:24)
      at java.mixins.fusionreactor.api.log.LogUtils.trackLogError(LogUtils.java:41)
      at com.intergral.fusionreactor.agent.pointcuts.logtracker.SLF4JPointCut$LoggingLevels$1.doLog(SLF4JPointCut.java:206)
      at com.intergral.fusionreactor.agent.pointcuts.logtracker.SLF4JPointCut.getAdvice(SLF4JPointCut.java:179)
      at com.intergral.fusionreactor.agent.advice.factory.AdviceFactory.getAdvice(AdviceFactory.java:64)
      at com.intergral.fusionreactor.agent.advice.factory.AdviceFactory.invoke(AdviceFactory.java:42)
      at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java)
      at com.whatever.eclipselink.Slf4jSessionLog.basicLog(Slf4jSessionLog.java:134)
      at com.whatever..eclipselink.Slf4jSessionLog.log(Slf4jSessionLog.java:109)
      at org.eclipse.persistence.logging.AbstractSessionLog.log(AbstractSessionLog.java:579)
      
      NOTE: This is where it gets interesting:
      This thread WriteLockManager.acquireLocksForClone has no tracing metadata in the 
      quote :Concurrency manager - Page 02 start - information about threads waiting to acquire (write/deferred) cache keys 
      This is because of the incorret implementation of the org.eclipse.persistence.internal.helper.ConcurrencyManager.putThreadAsWaitingToAcquireLockForWriting(Thread, String)
      
      at org.eclipse.persistence.internal.helper.ConcurrencyUtil.currentThreadIsStuckForSomeTimeProduceTinyLogMessage(ConcurrencyUtil.java:567)
      at org.eclipse.persistence.internal.helper.ConcurrencyUtil.determineIfReleaseDeferredLockAppearsToBeDeadLocked(ConcurrencyUtil.java:235)
      at org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone(WriteLockManager.java:171)
      
      at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1056)
      at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1014)
      at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:211)
      at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getFromIdentityMap(UnitOfWorkIdentityMapAccessor.java:139)
      at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4093)
      at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4016)
      at org.eclipse.persistence.mappings.CollectionMapping.buildElementUnitOfWorkClone(CollectionMapping.java:313)
      at org.eclipse.persistence.mappings.CollectionMapping.buildElementClone(CollectionMapping.java:326)
      at org.eclipse.persistence.internal.queries.ContainerPolicy.addNextValueFromIteratorInto(ContainerPolicy.java:217)
      at org.eclipse.persistence.mappings.CollectionMapping.buildCloneForPartObject(CollectionMapping.java:228)
      at org.eclipse.persistence.internal.indirection.UnitOfWorkQueryValueHolder.buildCloneFor(UnitOfWorkQueryValueHolder.java:62)
      at org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiateImpl(UnitOfWorkValueHolder.java:177)
      at org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiate(UnitOfWorkValueHolder.java:238)
      at org.eclipse.persistence.internal.indirection.DatabaseValueHolder.getValue(DatabaseValueHolder.java:97)
      at org.eclipse.persistence.indirection.IndirectList.buildDelegate(IndirectList.java:275)
      at org.eclipse.persistence.indirection.IndirectList.getDelegate(IndirectList.java:458)
      at org.eclipse.persistence.indirection.IndirectList.toArray(IndirectList.java:864)
      at java.base@11.0.16/java.util.Collections$UnmodifiableCollection.toArray(Collections.java:1037)
      at java.base@11.0.16/java.util.ArrayList.addAll(ArrayList.java:702)
      at com.whatever.entity.WhateverEntity.getSomeWhateverOneToManyField(WhateverEntity.java:616)
      
      ... rest of the stack trace is censored. 

As you can see in the stack trace above , the thread itself that is generative the massive dump was supposed to have create some tracing metadata. The metadata was nowhere to be found.

7. The Metadata Information Might Also Be Incorrect - Write Lock Manager Is Trying to Acquire a READ Lock Key, Not a WRITE Lock Key:

public CacheKey acquireLockAndRelatedLocks(Object objectForClone, Map lockedObjects, Map refreshedObjects, CacheKey cacheKey, ClassDescriptor descriptor, AbstractSession cloningSession) {
    if (!refreshedObjects.containsKey(objectForClone) && cloningSession.isConsideredInvalid(objectForClone, cacheKey, descriptor)) {
        return cacheKey;
    }
    // Attempt to get a read-lock; null is returned if it cannot be read-locked.
    if (cacheKey.acquireReadLockNoWait()) {
        if (cacheKey.getObject() == null) {
            // This will be the case for deleted objects, NoIdentityMap, and aggregates.
            lockedObjects.put(objectForClone, cacheKey);
        } else {
            objectForClone = cacheKey.getObject();
            if (lockedObjects.containsKey(objectForClone)) {
                // This check is for loss of identity; the original check in checkAndLockObject() will short-circuit in the usual case.
                cacheKey.releaseReadLock();
                return null;
            }
            // Store the locked cache key for release later.
            lockedObjects.put(objectForClone, cacheKey);
        }
        return traverseRelatedLocks(objectForClone, lockedObjects, refreshedObjects, descriptor, cloningSession);
    } else {
        // Return the cache key that could not be locked.
        return cacheKey;
    }
}

As we can see from the method above, we are using acquireReadLockNoWait. This choice aligns with more accurate metadata.

  1. Additional Notes: Addressing this issue will enhance performance and assist in our ongoing analysis of production issues. Thank you for your attention to this matter.

Note 1: We will provide Oracle with the in-house manipulated 2.6.x source code. This will allow them to compare the implementation of metadata acquisition and calls to the following method:

org.eclipse.persistence.internal.helper.ConcurrencyManager.removeCurrentThreadNoLongerWaitingToAcquireLockForWriting()

against the current 2.7.x version.

Note 2: I am also attaching a snippet of the manipulated 2.6.x classes for WriteLockManager and ConcurrencyManager.java. This snippet will facilitate the analysis of the mentioned methods.

Please note that not all ongoing fixes in the 2.7.x version have been back-merged into our old, patched 2.6.x code. Back-merging recent developments from 2.7.x to 2.6.x is currently a low-priority task. Nevertheless, the attached code is relevant in the context of this defect.

WriteLockManager_2_6_4.txt ConcurrencyManager_2_6_4.txt

99sono commented 4 months ago

This issue is now documented under a service request. Thanks a lot.

99sono commented 3 months ago

Side note: The 2.7.15 tag appears to address a defect related to the ConcurrencyManager. You can find the relevant changes in the following commits:

  1. Update ConcurrencyManager implementation to be in sync with master
  2. Pull request #1347

Additionally, here's the direct link to the specific commit: Commit details

  1. The code in the Write Lock manager looks to have been modified: [2.7.15 write lock manager with modified acquireLocksForClone] (https://github.com/eclipse-ee4j/eclipselink/blob/2.7.15/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/helper/WriteLockManager.java)

I am some hard time following the changes on the method. the changes look bigger than i anticipated. I see that the method continues to do a:

putThreadAsWaitingToAcquireLockForWriting

Whereas I would have expected a

lastCacheKeyWeNeededToWaitToAcquire.putCurrentThreadAsWaitingToAcquireLockForReading

As I had explained in point 7.

Also the new piece of code doing:

 toWaitOnLock.lock()

And later

 toWaitOnLock.unlock();

Looks like a radical change to me in the method implementation.

I cannot judge if that code is actually correct or not.

All I can see I was expecting the method implementation to look more like this:

 public Map acquireLocksForClone(Object objectForClone, ClassDescriptor descriptor, CacheKey cacheKey, AbstractSession cloningSession) {
        // TRK-19750 - basic variable initialization to be able to do the
        // determineIfReleaseDeferredLockAppearsToBeDeadLocked
        final Date whileStartDate = new Date();
        final Thread currentThread = Thread.currentThread();
        DeferredLockManager lockManager = ConcurrencyManager.getDeferredLockManager(currentThread);
        HackingEclipseReadLockManager readLockManager = ConcurrencyManager.getReadLockManager(currentThread);

        boolean successful = false;
        IdentityHashMap lockedObjects = new IdentityHashMap();
        IdentityHashMap refreshedObjects = new IdentityHashMap();

        CacheKey lastCacheKeyWeNeededToWaitToAcquire = null;
        try {
            // if the descriptor has indirection for all mappings then wait as there will be no deadlock risks
            CacheKey toWaitOn = acquireLockAndRelatedLocks(objectForClone, lockedObjects, refreshedObjects, cacheKey, descriptor, cloningSession);
            int tries = 0;
            while (toWaitOn != null) {// loop until we've tried too many times.

                for (Iterator lockedList = lockedObjects.values().iterator(); lockedList.hasNext();) {
                    ((CacheKey)lockedList.next()).releaseReadLock();
                    lockedList.remove();
                }

                // TRK-19750 - populate the static hash map
                // of the concurrenyc manager that we use for creating the massive log dump
                // to indicate that the current thread is now stuck trying to acquire some arbitrary
                // cache key for writing
                lastCacheKeyWeNeededToWaitToAcquire = toWaitOn;
                lastCacheKeyWeNeededToWaitToAcquire.putCurrentThreadAsWaitingToAcquireLockForReading(
                        "org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone(Object, ClassDescriptor, CacheKey, AbstractSession)");

                // TRK-19750 - Since we know this one of those methods that can appear in the dead locks
                // we threads frozen here forever inside of the wait that used to have no timeout
                // we will now always check for how long the current thread is stuck in this while loop going nowhere
                // using the exact same approach we have been adding to the concurrency manager
                HackingEclipseHelperUtil.SINGLETON.determineIfReleaseDeferredLockAppearsToBeDeadLocked(toWaitOn,
                        whileStartDate, lockManager, readLockManager,
                        WRITE_LOCK_MANAGER_IS_WILLING_TO_ALLOW_INTERRUPTED_EXCEPTION_TO_BE_FIRED_UP_IF_CONFIGURATION_WOULD_ALLOW_ID_TRUE);

                synchronized (toWaitOn) {
                    try {
                        if (toWaitOn.isAcquired()) {//last minute check to insure it is still locked.
                            // TRK-19750
                            // this is the famous WriteLockManager.acquireLocksForClone(WriteLockManager.java:92)
                            // being one of the spots where threads trying build objects can get stuck forever
                            // commenting out wait without a timeout
                            // if the thread that has the lock with write permissions is in a dead lock
                            // then we are not coming out
                            // toWaitOn.wait();// wait for lock on object to be released

                            // TRK-19750
                            // at wait with timout like everywhere else waiting without timeout is always wrong
                            toWaitOn.wait(10000l);
                        }
                    } catch (InterruptedException ex) {
                        // Ignore exception thread should continue.
                    }
                }
                Object waitObject = toWaitOn.getObject();
                // Object may be null for loss of identity.
                if (waitObject != null) {
                    cloningSession.checkAndRefreshInvalidObject(waitObject, toWaitOn, cloningSession.getDescriptor(waitObject));
                    refreshedObjects.put(waitObject, waitObject);
                }
                toWaitOn = acquireLockAndRelatedLocks(objectForClone, lockedObjects, refreshedObjects, cacheKey, descriptor, cloningSession);
                if ((toWaitOn != null) && ((++tries) > MAXTRIES)) {
                    // If we've tried too many times abort.
                    throw ConcurrencyException.maxTriesLockOnCloneExceded(objectForClone);
                }
            }
            successful = true;//successfully acquired all locks
        } catch (InterruptedException exception) {
            // TRK-19750 - if determineIfReleaseDeferredLockAppearsToBeDeadLocked is blowing up a thread stuck for too
            // long
            // run the lock of freeing up locks acquired by the thread
            // NOTE: we would be tempted to do this commented code bellow
            // cacheKey.releaseAllLocksAquiredByThread(lockManager, readLockManager);
            // throw ConcurrencyException.waitFailureOnClientSession(exception);
            // Instead what we do is just mimic the vanila behavior we have for the interrupted exception inside of the
            // wait
            // we must assume this is correct behavior
            throw ConcurrencyException.maxTriesLockOnCloneExceded(objectForClone);
        } finally {
            // TRK-19750 - remove from the static hash map
            // of the concurrency manager that we use for creating the massive log dump
            // any information we may have added that this thread was strugling to acquire any particular
            // cache key. The current thread is out of the wait to acquire logic now so we can consider the thread
            // as not no longer being stuck
            if (lastCacheKeyWeNeededToWaitToAcquire != null) {
                lastCacheKeyWeNeededToWaitToAcquire.removeCurrentThreadNoLongerWaitingToAcquireLockForReading();
            }

            if (!successful) {//did not acquire locks but we are exiting
                for (Iterator lockedList = lockedObjects.values().iterator(); lockedList.hasNext();) {
                    ((CacheKey)lockedList.next()).releaseReadLock();
                    lockedList.remove();
                }
            }
        }
        return lockedObjects;
    }

I would assume the changes in the: https://github.com/eclipse-ee4j/eclipselink/blob/2.7.15/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/helper/WriteLockManager.java

Are correct but I have some hard time seeing that is the case.

rfelcman commented 1 month ago

Sorry but is it this issue still valid? Because methods public void putThreadAsWaitingToAcquireLockForWriting(Thread thread, String methodName) https://github.com/eclipse-ee4j/eclipselink/blob/2.7/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/helper/ConcurrencyManager.java#L915

and

public void removeThreadNoLongerWaitingToAcquireLockForWriting(Thread thread) https://github.com/eclipse-ee4j/eclipselink/blob/2.7/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/helper/ConcurrencyManager.java#L924

looks OK to me.