LucidDB / luciddb

DEFUNCT: See README
https://github.com/LucidDB/luciddb
Apache License 2.0
53 stars 24 forks source link

[FRG-253] FarragoTestConcurrentTest aborts with a FarragoResource exception #619

Closed dynamobi-build closed 12 years ago

dynamobi-build commented 12 years ago

[reporter="johnk", created="Mon, 22 Jan 2007 14:08:15 -0500 (GMT-05:00)"] As of change 8565:

< Execution aborted
< net.sf.farrago.resource.FarragoResource$_Def0.ex(FarragoResource.java:1656)
< net.sf.farrago.fennel.FennelDbHandle.handleNativeException(FennelDbHandle.java:314)
< net.sf.farrago.fennel.FennelStreamGraph.fetch(FennelStreamGraph.java:188)
< net.sf.farrago.runtime.FennelTupleIter.populateBuffer(FennelTupleIter.java:109)
< net.sf.farrago.runtime.FennelAbstractTupleIter.fetchNext(FennelAbstractTupleIter.java:114)
< net.sf.farrago.dynamic.stmt342.ExecutableStmt$2.fetchNext(Unknown Source)
< org.eigenbase.runtime.TupleIterResultSet.next(TupleIterResultSet.java:99)
< net.sf.farrago.runtime.FarragoTupleIterResultSet.next(FarragoTupleIterResultSet.java:121)
< net.sf.farrago.test.concurrent.FarragoTestConcurrentScriptedCommandGenerator.storeResults(FarragoTestConcurrentScriptedCommandGenerator.java:765)
< net.sf.farrago.test.concurrent.FarragoTestConcurrentScriptedCommandGenerator.access$1000(FarragoTestConcurrentScriptedCommandGenerator.java:61)
< net.sf.farrago.test.concurrent.FarragoTestConcurrentScriptedCommandGenerator$SelectCommand.doExecute(FarragoTestConcurrentScriptedCommandGenerator.java:994)
< net.sf.farrago.test.concurrent.FarragoTestConcurrentCommandGenerator$AbstractCommand.execute(FarragoTestConcurrentCommandGenerator.java:592)
< net.sf.farrago.test.concurrent.FarragoTestConcurrentCommandExecutor.run(FarragoTestConcurrentCommandExecutor.java:184)

awash:eigenmerge/main:/home/jk/eigenmerge/main/farrago/unitsql/concurrent> ll joinNoLockstep*
-rw-r--r-- 1 jk jk 376618 2007-01-22 14:06 joinNoLockstep.log
-r--r--r-- 1 jk jk 169 2006-08-04 09:56 joinNoLockstep.mtsql
-r--r--r-- 1 jk jk 442494 2006-08-04 09:56 joinNoLockstep.ref
awash:eigenmerge/main:/home/jk/eigenmerge/main/farrago/unitsql/concurrent>


Reproduced in 3 of 3 test runs. One test run was on a loaded box, to reduce concurrency. Failed with same exception, but after a different number of iterations. I have not attempted a regression test on this branch with this box in quite some time.


dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 22 Jan 2007 14:33:37 -0500 (GMT-05:00)"] Could be from my very recent orderly shutdown changes. Do you know what change number you sync'd //open/dev from?

These changes have been running through successfully on both 2-way and 4-way boxes for me, but there must be something slipping through.

The exception says the execution is getting canceled. The only possible cause used to be a cancel() call on the Statement, or a kill_stmt/session procedure call. Now it happens as part of closing the statement, closing the session, or shutting down the DB. Maybe the test is doing one of those too eagerly.

dynamobi-build commented 12 years ago

[author="johnk", created="Mon, 22 Jan 2007 14:37:51 -0500 (GMT-05:00)"] Should be @8565, as reported (somewhat obscurely) in bug description. I did the sync this morning.

I'm also syncing a 4-way and HT (1.5 way?) client to see what I can reproduce.

The .mtsql file doesn't appear, to my untrained eyes, to do an explicit cancel:

awash:eigenmerge/main:/home/jk/eigenmerge/main/farrago/unitsql/concurrent> more joinNoLockstep.mtsql
@nolockstep

@thread t1,t2
        @repeat 100
                select emps.empno, emps.name, emps.gender, depts.*
                from sales.depts, sales.emps where emps.deptno = depts.deptno;
        @end
@end


dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 22 Jan 2007 16:27:20 -0500 (GMT-05:00)"] Obscurely: OK, I'm blind :)

Could you attach the output of FarragoTrace.log?

dynamobi-build commented 12 years ago

[author="johnk", created="Mon, 22 Jan 2007 21:04:59 -0500 (GMT-05:00)"] This test passed a few times on awash (amd_2, gcc3.3.6, and thus STLport4), but generally fails. Trace to follow.

I ran this test dozens of times on Intel_4, gcc4.12 (thus STLport5), and Intel*1.5, gcc3.3.6 and it didn't fail once.

The Black-Box tester in me says that there's a particular intersection of dimensions that causes this to happen.

dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 22 Jan 2007 23:15:38 -0500 (GMT-05:00)"] Amazingly, I was able to hit it once on my laptop just now, but the following five runs have passed.

dynamobi-build commented 12 years ago

[author="jvs", created="Mon, 22 Jan 2007 23:20:31 -0500 (GMT-05:00)"] OK, (and here I am going to use John K's favorite word), if I goose it up from 100 iterations to 1000, I can hit it every time. Should be easy to track down now.

dynamobi-build commented 12 years ago

[author="jvs", created="Tue, 23 Jan 2007 00:13:00 -0500 (GMT-05:00)"] Found it. The streamGraph member in FarragoRuntimeContext is the problem. It doesn't get nullified by FarragoRuntimeContext.closeAllocation, so the hazard is as follows:

1) owning stmt calls closeAllocation(); super.closeAllocation() will unpin the FennelStreamGraph, returning it into the global code cache (now available for reuse)

2) another thread looks it up in the code cache and reuses it (for this test, very likely, since all the threads execute the same SQL over and over!)

3) meanwhile, on the first thread, stmt calls cancel on the now-closed FarragoRuntimeContext (there's some redundancy up there which shows up in the trace log too); this aborts the unsuspecting second thread

Fix is easy (nullify, and check for null in cancel).

dynamobi-build commented 12 years ago

[author="jvs", created="Tue, 23 Jan 2007 00:30:50 -0500 (GMT-05:00)"] Fixed on //open/dev in eigenchange 8569.

dynamobi-build commented 12 years ago

[author="johnk", created="Tue, 23 Jan 2007 08:40:12 -0500 (GMT-05:00)"] Ran 11 of 11 times successfully on awash.
Thanks!