replicator in --direct mode does not resume replication after going offline with an empty THL

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. install a replicator in --direct mode
2. Before the master can send any event through, put the replicator offline
3. create a table in the master
4. put the replicator back online

What is the expected output?

The replicator has caught the event and the event is reproduced in the slave

What do you see instead?

An empty THL, no relay logs imported, no events reproduced in the slave

Original issue reported on code.google.com by g.maxia on 29 Jun 2011 at 9:31

GoogleCodeExporter commented 9 years ago

Clarification: Note that, if the replicator has received at least one event 
before going offline, it picks up replication just fine.

1. install a replicator in --direct mode
2. create a table in the master
3. check that the event was received
4. put the replicator offline
5. create another table in the master
6. put the replicator back online
7. check the events: both tables are in the slave

Only if you skip steps 2 and 3 the online operation fails.

Original comment by g.maxia on 29 Jun 2011 at 9:36

GoogleCodeExporter commented 9 years ago

Original comment by berkeley...@gmail.com on 30 Jun 2011 at 7:59

Changed state: Verified
Added labels: FixedIn-2.0.4
Removed labels: FixedIn

GoogleCodeExporter commented 9 years ago

Sorry for the incorrect verification. 
It is not fixed yet.

$ trepctl offline
$ mysql -h $MASTER -e 'create table test.t1(i int)'
$ trepctl online
$ thl -service Castor list
2011-06-30 11:35:57,798 INFO  thl.log.DiskLog Using directory 
'/home/tungsten/newinst/thl/Castor/' for replicator logs
2011-06-30 11:35:57,799 INFO  thl.log.DiskLog Checksums enabled for log 
records: true
2011-06-30 11:35:57,799 INFO  thl.log.DiskLog Using read-only log connection
2011-06-30 11:35:57,803 INFO  thl.log.DiskLog Loaded event serializer class: 
com.continuent.tungsten.replicator.thl.serializer.ProtobufSerializer
2011-06-30 11:35:57,804 INFO  thl.log.LogIndex Building file index on log 
directory: /home/tungsten/newinst/thl/Castor
2011-06-30 11:35:57,812 INFO  thl.log.LogIndex Constructed index; total log 
files added=1
2011-06-30 11:35:57,812 INFO  thl.log.DiskLog Validating last log file: 
/home/tungsten/newinst/thl/Castor/thl.data.0000000001
2011-06-30 11:35:57,812 INFO  thl.log.DiskLog Setting up log flush policy: 
fsyncIntervalMillis=0 fsyncOnFlush=false
2011-06-30 11:35:57,813 INFO  thl.log.DiskLog Idle log connection timeout: 
28800000ms
2011-06-30 11:35:57,813 INFO  thl.log.DiskLog Log preparation is complete
2011-06-30 11:35:57,815 ERROR replicator.thl.THLManagerCtrl Unable to find 
sequence number: -1

Original comment by g.maxia on 30 Jun 2011 at 9:44

Changed state: Started

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r269.

Original comment by berkeley...@gmail.com on 1 Jul 2011 at 6:33

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Sorry. Not fixed yet.

Case 1. 
* replicator offline with empty THL
* no changes from master in this period
* replicator goes online
* new changes from master reach the slave. 
* case OK.

Case 2. 
* replicator offline with empty THL
* master produces changes while replicator is offline
* replicator goes online
* the THL is still empty (changes made while the replicator was offline did not 
get in)
* case FAIL

Tested using build 152 

DATE: Fri Jul  1 07:19:11 UTC 2011
RELEASE: tungsten-replicator-2.0.4-152
USER ACCOUNT: hudson
BUILD_NUMBER: 152 
BUILD_ID: 152 
JOB_HAME: Build Replicator Branch-2.0 Google
BUILD_TAG: hudson-Build Replicator Branch-2.0 Google-152
HUDSON_URL: http://cc.aws.continuent.com/
SVN_REVISION: 268 
HOST: ip-10-251-90-63
SVN URLs:
  https://tungsten-replicator.googlecode.com/svn/trunk/commons
  https://tungsten-replicator.googlecode.com/svn/trunk/fsm
  https://tungsten-replicator.googlecode.com/svn/trunk/replicator
  https://tungsten-replicator.googlecode.com/svn/trunk/replicator-extra
  https://bristlecone.svn.sourceforge.net/svnroot/bristlecone/trunk/bristlecone
SVN Revisions:
  commons: Revision: 269 
  fsm: Revision: 269 
  replicator: Revision: 269 
  replicator-extra: Revision: 269 
  bristlecone: Revision: 105

Original comment by g.maxia on 1 Jul 2011 at 7:35

Changed state: Started

GoogleCodeExporter commented 9 years ago

Here's the root cause of this problem.  When the log is empty we start at the 
current position on the master.  Normally when operating a master we put a 
heartbeat event into the server so that something is written into the log.  
This is used for failover and works fine for master/slave topologies. 

However, in direct mode there are two problems: 

1.) There is confusion in the code about which DBMS should get the heartbeat.  
This is because we have two DBMS's and the heartbeat command unfortunately 
picks the slave.  That's a FAIL. 

2.) Adding insult to injury, we never even call the heartbeat command in the 
first place.  It is only called when starting a pipeline that is in the master 
role.  

One possible solution is for extractors to insert the heartbeat as they know 
where it goes.  However that has problems--we should only do this if we have an 
extractor that is really reading from a database.  So it is a conundrum that 
requires a little thought to avoid creating another mess.

Original comment by berkeley...@gmail.com on 28 Jul 2011 at 1:32

GoogleCodeExporter commented 9 years ago

Regarding comment #6, please notice that we can't insert heartbeat events in 
--direct mode, as there is no master.

A workaround that I use to overcome this issue is adding manually an event to 
the master ("DROP TABLE IF EXISTS mysql.non_existing_table").
Can we do something similar?

Original comment by g.maxia on 28 Jul 2011 at 3:46

GoogleCodeExporter commented 9 years ago

A possible workaround: 
Tungsten can create a DUD event that will be inserted into the THL when the 
service is created.
For example:
DROP TABLE IF EXISTS mysql.dummy_workaround_for_issue_136

Original comment by g.maxia on 8 Aug 2011 at 8:42

GoogleCodeExporter commented 9 years ago

Original comment by berkeley...@gmail.com on 8 Sep 2011 at 5:10

Added labels: FixedIn-2.0.5
Removed labels: FixedIn-2.0.4

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 23 Jan 2012 at 6:50

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 1 Mar 2012 at 9:45

Added labels: FixedIn-2.0.6

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 1 Mar 2012 at 9:46

Removed labels: FixedIn-2.0.5

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 20 Sep 2012 at 5:04

Added labels: FixedIn-2.0.7, Priority-Medium
Removed labels: FixedIn-2.0.6, Priority-Critical

GoogleCodeExporter commented 9 years ago

This needs a long-term fixed.  We are removing it from a scheduled version at 
this point until we get time to do a refit of the modeling used in replicator 
pipelines.

Original comment by robert.h...@continuent.com on 15 Jan 2013 at 4:54

Removed labels: FixedIn-2.0.7

GoogleCodeExporter commented 9 years ago

Hi,

I'm not exactly sure that i have this issue. But when i got alert that tungsten 
has broke. I checked status and found that there is some issue with processing 
log file  "Unable to prepare plugin: class 
name=com.continuent.tungsten.replicator.thl.THL message=[Found invalid log file 
header; log must be purged up to this file to open: 
/opt/installs/cookbook/thl/db1/thl.data.0000001060]"

ls -l /opt/installs/cookbook/thl/db1/thl.data.0000001060
-rw-r----- 1 root adm 0 Jun 22 06:40 
/opt/installs/cookbook/thl/db1/thl.data.0000001060

It seems there we some empty log files generated from thl.data.0000001060 to 
thl.data.0000001643. I can restore it by deleting empty one, But would like to 
know what loss will be there if i delete them and also would like to know the 
cause of generating some many logs.

My Setup:

3 Nodes : Node1, Node2, Node3 in multi-master replication within subnet, 
usually i'll have appliedlatency below 1. 

I would like to hear you're comments for below points.

1) Can i delete 0 byte file (Empty logs) and restore Tungsten?
2) What is it caused to generate so many empty log files?

Thanks,
Swaroop.

Original comment by swaroopk...@gmail.com on 24 Jun 2014 at 7:23

danielcheng007 / tungsten-replicator

replicator in --direct mode does not resume replication after going offline with an empty THL #136