[FRG-263] code cache concurrency problem with stale and non-cachable SQL statements

dynamobi-build commented 12 years ago

[reporter="jvs", created="Wed, 11 Apr 2007 16:06:31 -0500 (GMT-05:00)"] Zelaine Fong wrote:

John,

One of the other failures I'm looking at in last night's nightly run is
a failure in one of the large concurrency tests.
luciddb/test/sql/concurrency/large-delete2 fails with an assert while
trying to discard a cache entry that's pinned by another thread.

Working backwards from the stack, I think what's happening is the
following:

1) The test prior to large-delete2, large-delete1, creates a table t100.
One of the statements it executes in the test is "delete from t100
where kseq = 1". At the end of the test, t100 is dropped.

2) large-delete2 then creates t100 again. It then also issues a "delete
from t100 where kseq = 1" statement. That statement is already in the
Farrago statement cache as a result of large-delete1, but since the t100
it's referencing is the original t100 that's been dropped,
FarragoDatabase.isStale() returns true, and
FarragoDatabase.prepareStmtImpl() then tries to discard the out-of-date
cache entry by calling codeCache.discard().

The discard() asserts because the entry's pincount != 0. I think that's
because there are other threads in large-delete2 that are also trying to
execute "delete from t100 where kseq = 1". Therefore, they have the
entry pinned before it's been detected that the entry is out-of-date.

I don't think any of the changes I made for versioning changed this, and
it's probably just a race condition that we've been unlucky never to
have hit. In the nightly run, it only failed on the Redhat run. The
Ubuntu one passed. And I haven't been able to reproduce it on tikki01-red.

Does this sound right?

-- Zelaine

dynamobi-build commented 12 years ago

[author="jvs", created="Wed, 11 Apr 2007 16:07:36 -0500 (GMT-05:00)"] Your analysis looks correct to me. The bug must have been present since I first added the staleness check.

The code John Pham added for checking stmt.mayCacheImplementation looks similarly bad, since two statements may prepare the same uncacheable SQL at about the same time, meaning they will end up sharing it when they shouldn't.

The correct fix for both problems is probably to enhance FarragoObjectCache in such a way that both tests can be applied from within the cache lookup (rather than after the lookup returns), with the correct actions somehow being taken in response.

Logging ~~FRG-263~~ for now.

JVS

dynamobi-build commented 12 years ago

[author="elin", created="Thu, 17 May 2007 16:15:56 -0500 (GMT-05:00)"] Saw this again in nightly tests for 8103.9256. large-delete2 expects to see a failure to acquire lock on table LOCALDB.S.T100, but instead we get the assertion error below:

May 17, 2007 4:34:02 AM FarragoJdbcUtil newSqlException(ex)
FINER: THROW
java.lang.AssertionError: 0
at net.sf.farrago.util.FarragoObjectCache.discardEntry(FarragoObjectCache.java:411)
at net.sf.farrago.util.FarragoObjectCache.discard(FarragoObjectCache.java:380)
at net.sf.farrago.db.FarragoDatabase.prepareStmtImpl(FarragoDatabase.java:896)
at net.sf.farrago.db.FarragoDatabase.prepareStmt(FarragoDatabase.java:746)
at net.sf.farrago.db.FarragoDbSession.prepareImpl(FarragoDbSession.java:1023)
at net.sf.farrago.db.FarragoDbSession.prepare(FarragoDbSession.java:951)
at net.sf.farrago.db.FarragoDbStmtContext.prepare(FarragoDbStmtContext.java:116)
at net.sf.farrago.jdbc.engine.FarragoJdbcEngineConnection.prepareStatement(FarragoJdbcEngineConnection.java:327)
at net.sf.farrago.test.concurrent.FarragoTestConcurrentScriptedCommandGenerator$SqlCommand.doExecute(FarragoTestConcurrentScriptedCommandGenerator.java:1066)
at net.sf.farrago.test.concurrent.FarragoTestConcurrentCommandGenerator$AbstractCommand.execute(FarragoTestConcurrentCommandGenerator.java:592)
at net.sf.farrago.test.concurrent.FarragoTestConcurrentCommandExecutor.run(FarragoTestConcurrentCommandExecutor.java:184)
May 17, 2007 4:34:02 AM net.sf.farrago.db.FarragoDbStmtContext cancel
INFO: cancel

dynamobi-build commented 12 years ago

[author="jvs", created="Tue, 29 May 2007 14:20:48 -0500 (GMT-05:00)"] For an enhancement which is semi-related to this bug, see

http://sourceforge.net/mailarchive/forum.php?thread_name=03a701c7a22d%244e9564b0%246ffea8c0%40branston&forum_name=farrago-developers

dynamobi-build commented 12 years ago

[author="jhyde", created="Thu, 31 May 2007 18:59:58 -0500 (GMT-05:00)"] That semi-related issue is definitely related. I ran into the same stack trace that Elizabeth did.

dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 11 Jun 2007 10:03:14 -0500 (GMT-05:00)"] Fix checked in on //open/dev in eigenchange 9448. Keeping bug open pending review and further cleanup.

dynamobi-build commented 12 years ago

[author="jvs", created="Sun, 17 Jun 2007 01:13:01 -0500 (GMT-05:00)"] Further cleanup in eigenchange 9465.

LucidDB / luciddb

[FRG-263] code cache concurrency problem with stale and non-cachable SQL statements #609