bigcy / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 0 forks source link

Replicator watch for applied seqno hangs if slave replicator goes into error state when using parallel apply #682

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Set up master/slave with parallel apply enabled and multiple channels, e.g.: 

tungsten-installer --master-slave -a  --parallelization-type=disk --channels=5 
<other options>

2. Put a load on the master DBMS. 

3. Issue a call to wait for an applied seqno, e.g. using a JMX call to 
com.continuent.tungsten.replicator.management.OpenReplicatorManager.waitForAppli
edSequenceNumber().  

4. Cause the replicator to go into an error state before the seqno is reached 
on the slave. 

This sequence appears in regression tests used within Continuent when they hit 
Issue 679.   In theory you should be able to reproduce it using 'trepctl wait 
-applied NNN -limit 120' where NNN is a seqno value but I have not confirmed 
that.  Java-based tests that use JMX confirm it quite easily.  

What is the expected output?

The replicator should immediately return to the JMX client when the slave goes 
into OFFLINE:ERROR state.  You should see a stack trace like the following: 

[junit] java.lang.Exception: Wait operation failed: This watch was cancelled
    [junit]     at com.continuent.tungsten.replicator.management.OpenReplicatorManager.waitForAppliedSequenceNumber(OpenReplicatorManager.java:2350)

What do you see instead?

The replicator hangs for the timeout specified in the call to 
waitForAppliedSequenceNumber().  In some cases it seems to hang forever.  

What is the possible cause?

It looks as if waits are not being correctly cleared in the replicator when 
parallel apply is enabled. 

What is the proposed solution?

Find the bug and fix it.  This should have been detected by a unit test, so 
there is probably a test hole there.  It is pretty obvious in regression tests. 

Additional information

This does not occur if you install with parallel apply disabled, e.g., with 
--parallelization-type=none.   This issue is marked critical as it can block 
tests, since it at the minimum adds 2 minutes for each failure or may block 
completely. 

Use labels and text to provide additional information.

Original issue reported on code.google.com by robert.h...@continuent.com on 25 Aug 2013 at 4:51

GoogleCodeExporter commented 9 years ago
Just to be clear on adding time to tests--our regression tests have a 120 
second timeout when waiting for seqnos to appear on the slave.  Hence the extra 
2 minutes.  When parallel apply is disabled the call fails immediately.  

Original comment by robert.h...@continuent.com on 25 Aug 2013 at 4:53

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 26 Aug 2013 at 1:54

GoogleCodeExporter commented 9 years ago
This may be related to Issue 598, which looks like an off-by-one error in 
handling of watches when parallel replication is enabled. 

Original comment by robert.h...@continuent.com on 26 Aug 2013 at 2:32

GoogleCodeExporter commented 9 years ago
There won't be a 2.1.3.

Original comment by linas.vi...@continuent.com on 17 Sep 2013 at 10:13

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 20 Nov 2013 at 3:56

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 5 May 2014 at 11:09

GoogleCodeExporter commented 9 years ago
Will not use third version digit for normal releases anymore. It will only be 
increment for maintenance ones.

Original comment by linas.vi...@continuent.com on 26 May 2014 at 5:01

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18