eclipse-ee4j / genericmessagingra

BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

Race condition in InboundJmsResourcePool.java causes inbound message processing to halt #57

Open glassfishrobot opened 10 years ago

glassfishrobot commented 10 years ago

A race condition in com.sun.genericra.inbound.async.InboundJmsResourcePool.java can cause all inbound message processing on the queue associated with a particular pool instance to halt; processing on this queue will not resume until the affected application server is restarted.

MDBs configured with a MaxPoolSize of 1 are particularly vulnerable, as are queues receiving bursty traffic where messages sometimes arrive in considerable numbers faster than the MDB can process them. The onMessage() execution time also plays a role. The bug can manifest itself randomly within minutes, hours, or many days after startup. In clustered configurations, each server is separately and independently vulnerable to the race.

The race condition is triggered when the MQS driver, seeing a new message in the inbound queue, calls getServerSession() in one thread at the same time when a worker thread tries to release a ServerSession instance (= InboundJmsResource instance) back to the pool. In the problematic scenario, the resource is successfully marked as free, and connectionsInUse decremented, by the second thread; but the first thread, having found the pool exhausted a moment earlier, creates a PauseObject and puts itself on the pool's waitQ after the second thread has found the waitQ empty, and thus the first thread is never notified of the freed resource. This outcome can be verified in a heap dump; the only indication in the server log is that a "Notifying the thread" at level FINE is not logged.

I am going to attach a proposed fix. This also subsumes a fix for #56 (which had arisen at an earlier point of the same incident investigation).

Environment

Seen both with GlassFish 2.1.1-p17 through -p21, GenericJMSRA as bundled with this (= 2.0.1 from mid 2009), and with GlassFish 3.1.2.6, GenericJMSRA 2.1b, on Linux (SLES10SP2 and SLES11SP1) x86_64, and using the IBM MQS driver 7.0.1.6. By inspection, the bug has existed since inception of the code.

Affected Versions

[0.9, 1.0a, 1.0, v1.5, 1.5, v1.6, 1.6, 2.1a, 2.1b, current, 2.2a]

glassfishrobot commented 6 years ago
glassfishrobot commented 10 years ago

@glassfishrobot Commented Reported by gerhard_niklasch

glassfishrobot commented 10 years ago

@glassfishrobot Commented gerhard_niklasch said: This JIRA doesn't seem to allow me to upload attachments...

Unified context diffs against latest rev=236:

— ../../InboundJmsResourcePool.java,rev=236 2013-12-03 10:37:32.000000000 +0100 +++ com/sun/genericra/inbound/async/InboundJmsResourcePool.java 2014-01-14 11:27:28.000000000 +0100 @@ -1,5 +1,5 @@ /**

import javax.jms.; -import com.sun.genericra.GenericJMSRA; import javax.resource.spi.; import javax.resource.spi.endpoint.*; import javax.resource.ResourceException; @@ -34,7 +33,7 @@

/**

{ throw ExceptionUtils.newResourceException(e); }

}

while (result == null) { validate();

while (it.hasNext()) { @@ -283,25 +334,27 @@

if (resource.isFree())

{ connectionsInUse++; - + return resource.markAsBusy(); }

}

if (resources.size() < this.maxSize)

{ - InboundJmsResource res = create(); - resources.add(res); + InboundJmsResource resource = create(); + resources.add(resource); connectionsInUse++; - return res.markAsBusy(); + return resource.markAsBusy(); }

-

public synchronized void put(InboundJmsResource resource) { resource.markAsFree(); connectionsInUse--;

if (stopped) { if (connectionsInUse <= 0)

{ @@ -312,35 +365,55 @@ }

}

{ + obj = (PauseObject) waitQ.removeFirst(); + }

{ + // released resource is reserved for oldest waiter, about to + // be awakened + obj.resume(); + }

else

{ + // no waiters: released resource becomes immediately available + connectionsAvail++; + }

if (dmdCon != null)

{ this.dmdCon.close(); }

try

{ wait(this.consumer.getSpec().getEndpointReleaseTimeout() * 1000); @@ -349,20 +422,9 @@ }

}

while (it.hasNext())

{ @@ -372,78 +434,112 @@ obj.destroy(); }

catch (Exception e)

{ // This is just to make sure that if one resource fails to destroy - // we still call destroy on others. + // we still call destroy on others. _logger.log(Level.SEVERE, "Cannot destroy resource " + obj.toString()); }

}

{ + _logger.log(Level.FINE, "Destroying the ServerSession resource pool..."); + this.destroyed = true; + stop(); + releaseAllWaitingThreads(); + _logger.log(Level.FINE, "ServerSession resource pool destroyed."); + }

/**

if (elapsedWaitTime > maxWaitTime) {

synchronized void resume()

{ - _logger.log(Level.FINE, "Notifying the thread"); + _logger.log(Level.FINE, "Awakening and notifying the thread"); + awakened = true; notify(); }

}

— ../../ActivationSpec.java,rev=236 2013-12-12 13:16:49.000000000 +0100 +++ com/sun/genericra/inbound/ActivationSpec.java 2014-01-14 11:25:34.000000000 +0100 @@ -1,5 +1,5 @@ /**

glassfishrobot commented 10 years ago

@glassfishrobot Commented gerhard_niklasch said: Additional explanations of what's being changed and why:

com.sun.genericra.inbound.async.InboundJmsResourcePool com.sun.genericra.inbound.async.InboundJmsResourcePool$PauseObject

Unfortunately, maintaining the required invariants requires fixing a number of additional glitches, including two other race conditions.

The remainder here is cosmetic in nature:

com.sun.genericra.inbound.ActivationSpec

glassfishrobot commented 10 years ago

@glassfishrobot Commented gerhard_niklasch said: I see that the JIRA web interface has rather mangled the above context diff. Contact me offline via the dev mailing list for the literal version. I'll also be happy to supply a diff -ub (i.e. ignoring the whitespace-only changes) for ease of reviewing.

glassfishrobot commented 10 years ago

@glassfishrobot Commented gerhard_niklasch said: See also #58 for one further issue in this area, filed separately since its fix is logically independent from the present issue.

glassfishrobot commented 10 years ago

@glassfishrobot Commented gerhard_niklasch said: The putback for Bug#17973448 into trunk revision 237 delivers a fix for the primary race (somewhat different from my proposal, succesfully tested on both GF 2.1.1 and GF 3.1.2 with MQS7 as the JMS provider), and it also cleans up the formatting and improves the log messages.

It also changes the default MaxWaitTime from 3s to 300s.

Rev 237 still has a potential for a new caller of getServerSession() to race past older callers sitting in the wait queue, and starve the latter. However, this situation cannot arise with MQS: The MQS7 driver will never call getServerSession() to the same InboundJmsResourcePool from more than one dispatch thread simultaneously.

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA GENERICJMSRA-57