danielcheng007 / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 0 forks source link

Replicator has problems finding filtered events in log, leading to slave restart failures #397

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Set up master slave replication. 
2. Add a filter that looks like the following to drop 3/4 of the transactions 
generated on the master. 

/** Filter filter34.js */
function filter(event)
{
    // Drop 3/4 of events. 
    seqno = event.getSeqno();
    logger.info("Processing event: seqno=" + seqno);
    if (seqno % 4 > 0)
    {
         logger.info("Dropping event: seqno=" + seqno);
         return null;
    }
    else
    {
         return event;
    }
}

Add the filter as follows to the master pipeline of the master 
static-svc.properties.  

...
replicator.stage.q-to-thl.filters=filter34
...
replicator.filter.filter34=com.continuent.tungsten.replicator.filter.JavaScriptF
ilter
replicator.filter.filter34.script=${replicator.home.dir}/samples/extensions/java
script/filter34.js

3. Restart the replicator and issue a few of 'trepctl heartbeat' commands on 
the master.  

4. Use thl list to list events and find one of the filtered events.  Here's an 
example: 

thl list
...
SEQ# = 25 / FRAG# = 0 (last frag)
- TIME = 2012-11-03 10:41:45.919
- EPOCH# = 0
- EVENTID = mysql-bin.000159:0000000000006555;0
- SOURCEID = logos1
- METADATA = []
- TYPE = com.continuent.tungsten.replicator.event.ReplDBMSFilteredEvent
## Filtered events ##
From Seqno# 25 / Fragno# 0
To Seqno# 27 / Fragno# 0

5. Try to find any seqno other than the first in the filtered event, e.g.: 

thl list -seqno 26

6. Now take the slave offline.  Use SQL to alter the trep_commit_seqno table on 
the slave to point to a seqno inside the filtered event, e.g.: 

mysql> update trep_commit_seqno set seqno=26;

7. Bring the slave back online.  

What is the expected output?

a.) In the case of the thl list in step 6, you would expect to find the seqno.  

b.) In the case of the thl list, you would likewise expect to restart properly 
even if the transaction is slightly off.  

What do you see instead?

a.) The thl utility cannot find the filtered event, e.g.:  

thl list -seqno 26
...
2012-11-03 10:42:34,852 [ - main] ERROR thl.THLManagerCtrl Unable to find 
sequence number: 26

b.) The replicator fails to go online.  The status looks like the following: 

trepctl status
...
pendingError           : Read failed on transaction log: seqno=27 taskId=0
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: Log seek failure: expected seqno=27 found seqno=28
...

What is the possible cause?

Log seek logic does not work for filtered events.  We do not return the 
filtered event and instead may return the next full event. 

What is the proposed solution?

Alter log seek logic to return the filtered event.  Clients must be prepared 
for the filtered event to contain their sequence number rather than have it at 
either end. 

Additional information

This problem seems obscure but it's not.  If you need to reset 
trep_commit_seqno in deployments with a lot of filtered events, you can run 
into serious problems setting the restart point correctly.  Also, under some 
circumstances it appears slave replicators will not restart if the logs have 
the right combination of filtered and empty events.  This came up in yet 
another deployment and appears to be due to a separate bug. 

Use labels and text to provide additional information.

Original issue reported on code.google.com by robert.h...@continuent.com on 3 Nov 2012 at 6:01

GoogleCodeExporter commented 9 years ago
This problem is fixed and tested.  The seek logic was changed as described 
above.  

Original comment by robert.h...@continuent.com on 20 Dec 2012 at 6:07