h2 hangs for a long time then (sometimes) recovers

GoogleCodeExporter commented 9 years ago

I'm not sure how to provide test data or code for any of this. We're using H2 
on our server for a largeish database (anywhere from 6 gigs to 180 gigs), we're 
using memory tables which are stored on disk as well. The Java H2 server 
process is set to run with -Xmx values > 200 gigs (and our server has that free 
ram available before we run the h2 server). The h2 server runs in a separate 
JVM process, all this process does is launch the h2 server, there is no other 
activity in that process; so this process which has > 200 gigs available just 
runs H2 server, nothing more.

One of the applications is a tomcat application, there is no connection pooling 
being used in tomcat. We have a few other windows apps which tap into the 
database, nothing complicated. There are no connection pools used anywhere 
(yet).

Every once in a while the queries in our applications hang. They sit there for 
10 / 30 minutes at times, then all of a sudden it clears up and they can all 
run again. Sometimes it hangs for > 2 hours and the only solution is to kill 
the database and restart it. This is problematic because we are not always able 
to recover the database, sometimes when loading the database from disk we get 
h2 server errors (corruption in the database) and so the only option there is 
to rebuild the database from scratch (a 2 day process). 

I don't know what is causing the issue so I can't yet provide steps to 
reproduce; I only know it happens sometimes twice in one day causing our 
applications to go down. 

I have added the following to the initial jdbc url:
PAGE_SIZE=16384;MAX_QUERY_TIMEOUT=300000

This is added to the h2 console from within firefox; so we start the h2 server 
which launches firefox / h2 console; I enter credentials + database url and 
then add those two parameters.

To me it feels like 1 of two things; either resource starvation somewhere, or 
the lack of multi-threading (maybe the queries are being queued up and getting 
stuck somewhere). I have also tried the experimental multithread option but 
that caused query errors within two minutes of turning it on. I forget the 
exact error message but can provide this if required.

I am definitely closing all sql connections when they are finished; the 
database framework we use has detection built in that warns us when a 
connection is not closed (if the connection is cleaned up by the garbage 
collector and it wasn't explicity closed by the app then it logs the message; 
we have received no such messages).

Is there some variable or setting I could try that would help out here? We 
would like to stay with H2 due to the quick speed when tables are in ram, and a 
few other features which are really nice; but right now our applications are 
unstable and we have to track this down somehow.

Thanks in advance for any comments or suggestions.

Original issue reported on code.google.com by localde...@gmail.com on 28 Jun 2014 at 1:39

GoogleCodeExporter commented 9 years ago

We face similar issues with version 1.4.179. Sometimes a thread just keeps 
hanging while trying to obtain lock and it can take an hour or more. This 
behaviour did not occur with version 1.3.175. Unfortunately, I was not able to 
determine any further details so far.

The pattern seen in thread dumps is the following..
"TP-Processor40" Id=14830 TIMED_WAITING on org.h2.engine.Database@5be7d382
    at java.lang.Object.wait(Native Method)
    -  waiting on org.h2.engine.Database@5be7d382
    at org.h2.table.RegularTable.doLock(RegularTable.java:549)
    at org.h2.table.RegularTable.lock(RegularTable.java:463)
    at org.h2.engine.Database.lockMeta(Database.java:880)
    at org.h2.engine.Database.addDatabaseObject(Database.java:994)
    at org.h2.command.ddl.CreateSchema.update(CreateSchema.java:52)
    at org.h2.command.CommandContainer.update(CommandContainer.java:78)
    at org.h2.command.Command.executeUpdate(Command.java:253)
    -  locked org.h2.engine.Session@b32a1eb
    at org.h2.jdbc.JdbcStatement.executeInternal(JdbcStatement.java:185)
    -  locked org.h2.engine.Session@b32a1eb
    at org.h2.jdbc.JdbcStatement.execute(JdbcStatement.java:159)
    at org.h2.tools.RunScript.execute(RunScript.java:170)
...

Original comment by radek.kr...@gmail.com on 3 Jul 2014 at 12:59

GoogleCodeExporter commented 9 years ago

Could you please provide full thread dumps (that is, all threads), plus a 
reproducible test case if possible?

Original comment by thomas.t...@gmail.com on 4 Jul 2014 at 7:31

GoogleCodeExporter commented 9 years ago

Patch committed for this in revision 5780.

Will be available in Version 1.4.180

Original comment by noelgrandin on 8 Jul 2014 at 8:21

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I had a look at the stack traces. It looks like the problem is that you 
synchronize on the class in a central place, so that only one transaction can 
run at a time:

    at com.polarion.platform.sql.internal.def.BaselineLock.createIfNeeded(BaselineLock.java:45)
    -  locked java.lang.Class@7fe07fe8
    at com.polarion.platform.sql.internal.def.DBDef.createBaseline(DBDef.java:649)

The thread that own the lock will then try to lock the table, and eventually 
time out. All the other threads (including the thread that locks the table) 
can't continue because they wait for the lock on this class.

So I think this is a "user error", and the patch from Noel will not actually 
solve your problem. To solve your problem, you need to use a different 
mechanism (not lock on the class).

Original comment by dora.pet...@gmail.com on 16 Jul 2014 at 12:30

GoogleCodeExporter commented 9 years ago

Thanks for taking a look at the stack traces. I agree this implementation is 
not ideal, but we had to implement it this way. In case of multi-threaded write 
to the DB, the database eventually became corrupted so we were forced to ensure 
that only one thread can write to the DB at a time.

This implementation works fine in production for more than a year, so I am sure 
that the recent problems were new in 1.4.x version of H2 DB. Also not that 
after timeout of the hanging transaction, the application resumes are works 
fine again.

Let's see if this fix helps to deal with the issue. My plan is to remove this 
synchronization as soon as 1.4 version with multi-threaded access with MVCC is 
stable enough to support my needs, i.e. DB does not get corrupted on concurrent 
write.

Original comment by radek.kr...@gmail.com on 23 Jul 2014 at 1:57

GoogleCodeExporter commented 9 years ago

Radek, what does your full database URL look like?

Original comment by noelgrandin on 23 Jul 2014 at 2:01

GoogleCodeExporter commented 9 years ago

We came across this bug last week and we are also experiencing hangs when using 
h2 in our tests using version 1.4.179 and 1.4.180. Our workaround is to use a 
different DB name for each test. That seems to fix the hangs. Not sure why.

Original comment by ismaelj on 23 Jul 2014 at 2:02

GoogleCodeExporter commented 9 years ago

On the production server it would be 
jdbc:h2:tcp://10.10.10.10:9090//opt/polarion/data/multi-instance/cns/workspace/p
olarion-data/database/polarion;MV_STORE=FALSE;MULTI_THREADED=TRUE;MODE=PostgreSQ
L;LOCK_TIMEOUT=1800000;MAX_COMPACT_TIME=100000

Original comment by radek.kr...@gmail.com on 23 Jul 2014 at 6:50

GoogleCodeExporter commented 9 years ago

Instead of creating your own synchronisation point, you would have better luck 
if you just stopped using the MULTI_THREADED=TRUE option (which is still 
experimental) and you would have better performance as well.

Original comment by noelgrandin on 23 Jul 2014 at 7:18

GoogleCodeExporter commented 9 years ago

Well, we tried but performance was far from acceptable, so we had to take this 
approach.

I'm looking forward to see the MVStore stable enough for production use of 
multi-threaded access with MVCC, but it will still take some time to stabilize 
it.

Original comment by radek.kr...@gmail.com on 24 Jul 2014 at 9:32

feristhia / h2database

h2 hangs for a long time then (sometimes) recovers #567