Heterogenous master replicator fails with pt-online-schema-change

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Configure a master for heterogenous replication (MySQL-to-Hadoop) 
2. load the employees test database  (http://launchpad.net/test-db)
3. make sure it is replicated
4. run pt-online-schema-change, with some transactions running in parallel

pt-online-schema-change --user tungsten --password secret --host 127.0.0.1 
--port 3306 --alter 'add column TS1 timestamp' --execute D=employees,t=salaries 
 > schema_change.log 2>&1 &

sleep 10

mysql -e 'insert into employees.salaries (emp_no,salary,from_date, to_date) 
values (499999, 10000,"2009-09-01", "2012-07-19")'
mysql -e 'insert into employees.salaries (emp_no,salary,from_date, to_date) 
values (499998, 12000, "2009-09-02", "2012-07-18")'
mysql -e 'update  employees.salaries set to_date="2001-10-23" where emp_no = 
19999 and salary = 82375'
mysql -e 'delete from  employees.salaries where emp_no = 19999 and salary = 
77775'

What is the expected output?

The data should be transferred to the slave without problems

What do you see instead?

The master fails

trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : NONE
appliedLastSeqno       : -1
appliedLatency         : -1.0
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : -1
clusterName            : batch
currentEventId         : NONE
currentTimeMillis      : 1400820800441
dataServerHost         : ubuntu1
extensions             :
host                   : ubuntu1
latestEpochNumber      : -1
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://ubuntu1:2112/
maximumStoredSeqNo     : -1
minimumStoredSeqNo     : -1
offlineRequests        : NONE
pendingError           : Stage task failed: stage=binlog-to-q seqno=271 fragno=0
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000004:0000000094743633;-1
pendingErrorSeqno      : 271
pendingExceptionMessage: Unable to find column metadata; table may be missing: 
schema=employees table=_salaries_new
pipelineSource         : UNKNOWN
relativeLatency        : -1.0
resourcePrecedence     : 99
rmiPort                : 10000
role                   : master
seqnoType              : java.lang.Long
serviceName            : batch
serviceType            : unknown
simpleServiceName      : batch
siteName               : default
sourceId               : ubuntu1
state                  : OFFLINE:ERROR
timeInStateSeconds     : 17.947
transitioningTo        :
uptimeSeconds          : 24.511
useSSLConnection       : false
version                : Tungsten Replicator 3.0.0 build 69
Finished status command...

What is the possible cause?

...

What is the proposed solution?

...

Additional information

log file from the master:

INFO   | jvm 3    | 2014/05/23 06:52:58 | 2014-05-23 04:52:58,595 [ - 
WrapperSimpleAppMain] INFO  management.ReplicationServiceManager internal/local 
replication service 'batch' started successfully
INFO   | jvm 3    | 2014/05/23 06:52:58 | 2014-05-23 04:52:58,596 [batch - 
pool-2-thread-1] INFO  management.OpenReplicatorManager Sent State Change 
Notification GOING-ONLINE:SYNCHRONIZING -> ONLINE
INFO   | jvm 3    | 2014/05/23 06:52:59 | 2014-05-23 04:52:59,910 [batch - 
ConnectorHandler: initializing] INFO  thl.ConnectorHandler Log epoch numbers 
checked and match: client source ID=hadoop1 seqno=214 epoch number=188
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,250 [batch - 
binlog-to-q-0] ERROR pipeline.SingleThreadStageTask Stage task failed: 
stage=binlog-to-q seqno=271 fragno=0
INFO   | jvm 3    | 2014/05/23 06:53:00 | 
com.continuent.tungsten.replicator.ReplicatorException: Unable to find column 
metadata; table may be missing: schema=employees table=_salaries_new
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.filter.ColumnNameFilter.getColumnInformation(
ColumnNameFilter.java:275)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.filter.ColumnNameFilter.filter(ColumnNameFilt
er.java:159)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.runTask(Single
ThreadStageTask.java:381)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.run(SingleThre
adStageTask.java:179)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
java.lang.Thread.run(Thread.java:724)
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,252 [batch - 
binlog-to-q-0] INFO  pipeline.SingleThreadStageTask Performing emergency 
rollback of applied changes
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,252 [batch - 
binlog-to-q-0] INFO  pipeline.SingleThreadStageTask Dispatching error event: 
Stage task failed: stage=binlog-to-q seqno=271 fragno=0
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] ERROR management.OpenReplicatorManager Received error 
notification, shutting down services :
INFO   | jvm 3    | 2014/05/23 06:53:00 | Stage task failed: stage=binlog-to-q 
seqno=271 fragno=0
INFO   | jvm 3    | 2014/05/23 06:53:00 | 
com.continuent.tungsten.replicator.ReplicatorException: Unable to find column 
metadata; table may be missing: schema=employees table=_salaries_new
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.filter.ColumnNameFilter.getColumnInformation(
ColumnNameFilter.java:275)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.filter.ColumnNameFilter.filter(ColumnNameFilt
er.java:159)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.runTask(Single
ThreadStageTask.java:381)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
com.continuent.tungsten.replicator.pipeline.SingleThreadStageTask.run(SingleThre
adStageTask.java:179)
INFO   | jvm 3    | 2014/05/23 06:53:00 |       at 
java.lang.Thread.run(Thread.java:724)
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] WARN  management.OpenReplicatorManager Performing emergency 
service shutdown
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] INFO  pipeline.Pipeline Shutting down pipeline: master
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
binlog-to-q-0] INFO  pipeline.SingleThreadStageTask Terminating processing for 
stage task thread
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
binlog-to-q-0] INFO  pipeline.SingleThreadStageTask Last successfully processed 
event prior to termination: seqno=271 
eventid=mysql-bin.000004:0000000094743633;-1
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
binlog-to-q-0] INFO  pipeline.SingleThreadStageTask Task event count: 1
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
q-to-thl-0] WARN  pipeline.SingleThreadStageTask Received unexpected interrupt 
in stage task: q-to-thl
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
q-to-thl-0] INFO  pipeline.SingleThreadStageTask Terminating processing for 
stage task thread
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
q-to-thl-0] INFO  pipeline.SingleThreadStageTask Task event count: 0
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] INFO  pipeline.Pipeline Releasing pipeline: master
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] INFO  pipeline.StageTaskGroup Releasing tasks for stage: 
binlog-to-q
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,255 [batch - 
pool-2-thread-1] INFO  extractor.ExtractorWrapper Releasing raw extractor and 
heartbeat filter
INFO   | jvm 3    | 2014/05/23 06:53:00 | 2014-05-23 04:53:00,256 [batch - 
pool-2-thread-1] INFO  pipeline.StageTaskGroup Releasing tasks for stage: 
q-to-thl

Original issue reported on code.google.com by g.maxia on 23 May 2014 at 5:03

GoogleCodeExporter commented 9 years ago

A note on reproducing this case.  There is a general problem with the 
ColumnNameFilter (and probably PrimaryKeyFilter), namely that they fail if the 
table being looked up cannot be found.  

I had trouble reproducing this error until I turned off replication (e.g., 
issue 'trepctl offline') before running pt-online-schema-change, then ran 
'trepctl online' once everything finished.  If you do this you will easily get 
the failure, because when you turn on the replicator it will not be able to 
find _salaries_new since it has been cleaned up.

Original comment by robert.h...@continuent.com on 29 May 2014 at 10:22

GoogleCodeExporter commented 9 years ago

This bug is now corrected.  The implementation is as follows: 

1. The colnames filter has been extended to include a boolean field 
ignoreMissingTables.  This is set true in the .tpl file for the filter as well 
as the class, as shown by the following example: 

# Column name filter.  Adds column name metadata to row updates.  This is
# required for MySQL row replication if you have logic that requires column
# names.
replicator.filter.colnames=com.continuent.tungsten.replicator.filter.ColumnNameF
ilter
# Heterogeneous topologies need signed/unsigned information.
replicator.filter.colnames.addSignedFlag=true
# If true, ignore tables that have no metadata and generate automatic column
# names. This can happen when reading an old DBMS log in which case current
# metadata is not accessible if tables have sinc been deleted. If false,
# the filter will stop with an exception.
replicator.filter.colnames.ignoreMissingTables=true

When this parameter is true, the filter generates dummy names of the form 
col_1, col_2, ..., col_N for any change that has missing fields. 

2. The enumtostring filter now avoids making changes if it cannot find table 
metadata. 

You can prove that the problems are corrected by setting up a Tungsten master 
enabled for heterogeneous replication and running the following script on the 
master host: 

#!/bin/bash
# Reproduce Issue 924. 
set -x
trepctl offline
mysql -uroot <<SQL
drop table if exists test.no_schema;
create table test.no_schema (id int primary key, data1 varchar(20));
insert into test.no_schema values(1, 'one');
update test.no_schema set data1 = 'two' where id=1;
delete from test.no_schema where id=1;
drop table if exists test.no_schema;
SQL
trepctl online

Original comment by robert.h...@continuent.com on 30 May 2014 at 7:11

Changed state: QA

GoogleCodeExporter commented 9 years ago

Initial tests show that this fix is working fine. We no longer get a trip up of 
the replicator when the structure changes, and the updated CSV files match the 
updated structure as follows.

Original comment by mc.br...@continuent.com on 2 Jun 2014 at 10:48

GoogleCodeExporter commented 9 years ago

Confirmed the fix using two different methods (the initial one used when 
reporting the issue and the one suggested by Robert).
The issue is solved. A permanent test is needed.

Original comment by g.maxia on 19 Sep 2014 at 3:16

Changed state: Documenting
Added labels: Test-Needed

GoogleCodeExporter commented 9 years ago

The documentation has been updated, and an entry has been added to the 3.0 
release notes.

Original comment by mc.br...@continuent.com on 13 Oct 2014 at 10:49

Changed state: Fixed

epermana / tungsten-replicator

Heterogenous master replicator fails with pt-online-schema-change #924