Enable replicators to track schema changes using filters and force commits between commands that arrive on replicas

GoogleCodeExporter commented 9 years ago

1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

When loading a data warehouse such as Hadoop or Vertica we need the ability to 
force a commit when table schema changes.  In particular changing table columns 
invalidates the currently open CSV file and will cause an error.  We want to 
force a commit and clear the table definition to cure this. 

3. Describe the feature interface

Replicators will support two new filters to help with schema change management. 

schemachange - A Java filter that looks for operations on schemas (e.g. CREATE 
SCHEMA) and tables (e.g., ALTER TABLE or DROP TABLE).  It annotates the 
containing transaction metadata with the flag "schema_change" and adds metadata 
to DDL statements to identify schema, table, and operation type.   This is best 
executed on the master at the time transactions are generated. 

monitorschemachange - A Javascript filter to log such schema changes and 
optionally at the 'force_commit' flag to transactions to ensure an immediate 
commit.  This is best executed on the applier on a slave. 

4. Give an idea (if applicable) of a possible implementation

Filters will be implemented as described above. 

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

It will enable batch loading to handle schema changes without forcing planned 
downtime or somehow coordinating between the upstream extractor and downstream 
appliers. 

5b. What hardship will the human race have to endure if this feature is
implemented.

Schema change introduces a number of complex problems.  This feature is the 
beginning of a series of improvements to grapple with it on a large scale in a 
way that decouples transaction producers and consumers.  It is not the last 
improvement in this area.  

6. Notes

This feature depends on Issue 858.

Original issue reported on code.google.com by robert.h...@continuent.com on 14 Apr 2014 at 5:36

GoogleCodeExporter commented 9 years ago

This feature is now implemented.  Here is a description including notes on 
testing.  

1. Filter implementations.  This feature adds two filters as described in the 
feature description. 

schemachange - A Java filter to detect schema changes by parsing statements and 
create permanent annotations on the statements. This filter also detects 
truncate statements. Here is the general behavior.

a.) Any statement that affects the logical schema of the table, such as 
CREATE/DROP SCHEMA, CREATE/DROP TABLE, or ALTER causes the annotation 
##schema_change to be added to the event metadata.

b.) Similarly any statement that is a table TRUNCATE operate causes the 
annotation ##truncate to be added to the event metadata. 

c.) In both cases the operation type, schema name, and table name (if 
appropriate) are added to the statement metadata.

monitorschemachange - A Javascript filter to log changes detected by the 
schemachange filter and optionally force a commit when such a change is 
processed.  

2. Installation.  Filters can be installed anywhere that is convenient with the 
caveat that schemachange must run before monitorschemachange for the latter to 
work.  Here is a sample installation.  

a.) Install schemachange on master as in the following example.  This way the 
schema annotations are visible in the log. 

tools/tpm install batch1 \
...
  --svc-extractor-filters=colnames,pkey,schemachange \
...
  --start-and-report

b.) Install monitorschemachange on the slave applier stage.  This logs the 
changes on the slave and causes a commit only in the final stage.  

tools/tpm install batch1 \
...
  --svc-applier-filters=monitorschemachange \
...
  --start-and-report

3. To test schemachange filter operation, enter the following types of 
statements: 

CREATE SCHEMA bar
DROP SCHEMA bar
CREATE TABLE foo
DROP TABLE foo
ALTER TABLE foo ADD COLUMN data1 varchar(30)
ALTER TABLE foo DROP COLUMN data1

All of these should generate annotations on transactions stored in the THL like 
the following example for ALTER TABLE.  Note the 'schema_change' annotation on 
the transaction and the ##operation, ##schema, and ##table on the statement. 

SEQ# = 2 / FRAG# = 0 (last frag)
- TIME = 2014-05-05 21:57:24.0
- EPOCH# = 0
- EVENTID = mysql-bin.000287:0000000000000639;70
- SOURCEID = logos1
- METADATA = 
[mysql_server_id=1;unsafe_for_block_commit;dbms_type=mysql;service=batch;shard=t
est;schema_change]
- TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent
- OPTIONS = [##charset = UTF-8, autocommit = 1, sql_auto_is_null = 0, 
foreign_key_checks = 1, unique_checks = 1, sql_mode = 'NO_ENGINE_SUBSTITUTION', 
character_set_client = 33, collation_connection = 33, collation_server = 33, 
##schema = test, ##table = foo, ##operation = ALTER TABLE]
- SCHEMA = 
- SQL(0) = alter table test.foo add column data1 varchar(30)

You can also enter a truncate table command, which should generate similar 
annotations: 

SEQ# = 3 / FRAG# = 0 (last frag)
- TIME = 2014-05-05 21:59:35.0
- EPOCH# = 0
- EVENTID = mysql-bin.000287:0000000000000719;71
- SOURCEID = logos1
- METADATA = 
[mysql_server_id=1;dbms_type=mysql;service=batch;shard=test;truncate]
- TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent
- OPTIONS = [##charset = UTF-8, autocommit = 1, sql_auto_is_null = 0, 
foreign_key_checks = 1, unique_checks = 1, sql_mode = 'NO_ENGINE_SUBSTITUTION', 
character_set_client = 33, collation_connection = 33, collation_server = 33, 
##schema = test, ##table = foo, ##operation = TRUNCATE]
- SCHEMA = 
- SQL(0) = truncate test.foo

4.) To test monitorschemachange operation, look for the following: 

a.) The slave replicator log should have messages like the following for schema 
change and truncate commands: 

INFO   | jvm 1    | 2014/05/05 22:01:06 | 2014-05-06 05:01:06,422 [batch - 
q-to-dbms-0] INFO  filter.JavaScriptFilter SCHEMA CHANGE: seqno=2 schema=test 
table=foo operation=ALTER TABLE sql=[alter table test.foo add column data1 
varchar(30)]

The default behavior of the monitorschemachange filter is to force a schema 
change, which results in the following message: 

INFO   | jvm 1    | 2014/05/05 22:09:30 | 2014-05-06 05:09:30,551 [batch - 
q-to-dbms-0] INFO  filter.JavaScriptFilter Forcing commit due to schema change: 
seqno=2

Original comment by robert.h...@continuent.com on 6 May 2014 at 5:13

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r2219.

Original comment by robert.h...@continuent.com on 9 May 2014 at 12:31

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 9 May 2014 at 1:47

Changed state: QA

GoogleCodeExporter commented 9 years ago

To be more useful this implementation should include the following capability: 

1.) For each schema change that may require data to be refreshed from the 
source, generate a file in a designated directory.  This file should have the 
following properties. 

a.) The name has the form 'schema.table'
b.) It is in a subdirectory that matches the replicator service name. 

2.) The file is generated for truncate, drop/create table.  It may need to be 
generated for alter as well.  (This should be parameterizable?)

For Hadoop implementations we can then write a script that checks these files 
and re-sqoops source data automatically, after which the file can be cleared.

Original comment by robert.h...@continuent.com on 16 May 2014 at 3:39

Changed state: Started

GoogleCodeExporter commented 9 years ago

The Oracle applier needs to be updated to at least replicate TRUNCATE 
statements. Ideally any DDL could be replicated. There could be an option to 
enable that portion.

Original comment by jeff.m...@continuent.com on 16 May 2014 at 1:06

GoogleCodeExporter commented 9 years ago

It might be useful if this basic structure was adaptable to other database 
targets. 

For example, the content of the file could be a CSV defining schema/table and 
change type: 

sales.invoices,alter
sales,alter

This information could then be read and/or imported into a destination database 
to track for changes. 

For example, with Vertica/InfiniDB we could use and parse this information to 
more easily identify when a ddlscan update is required.

Original comment by mc.br...@continuent.com on 16 May 2014 at 1:42

GoogleCodeExporter commented 9 years ago

The schemachange and monitorschemachange filters are adapted as follows to 
catch changes from pt-online-schema-change and generate suitable notifications. 
 Here is a summary of the behavior. 

1. The schemachange now generates THL annotations when there is a RENAME TABLE 
command.  They appear as follows.  This allows us to pick up the "schema 
change" that occurs when a pt-online-schema-change completes successfully.  

SEQ# = 20 / FRAG# = 0 (last frag)
...
- OPTIONS = [##charset = ISO8859_1, autocommit = 1, sql_auto_is_null = 0, 
foreign_key_checks = 1, unique_checks = 1, sql_mode = 
'NO_ENGINE_SUBSTITUTION,NO_AUTO_VALUE_ON_ZERO', character_set_client = 8, 
collation_connection = 8, collation_server = 33, ##schema = test, ##table = 
foo, ##operation = RENAME TABLE]
- SCHEMA = test
- SQL(0) = RENAME TABLE `test`.`foo` TO `test`.`_foo_old`, `test`.`_foo_new` TO 
`test`.`foo`

2. The monitorschemachange filter has a notify property that enable generating 
of notification files.  To enable this, add the following property to tpm 
commands: 

  --property=replicator.filter.monitorschemachange.notify=true

When set to true, the replicator generates notification files of the form 
'schema.table' for table changes and 'schema.-none-' for schema changes (e.g. 
drop/create).  Here is an example of the file generated for a rename: 

# Schema change notification
sourceId=logos1
seqno=112896
commitTimestamp=2014-05-31 17:42:52.0
schema=test
table=schema_change
operation=RENAME TABLE
sql=RENAME TABLE `test`.`schema_change` TO `test`.`_schema_change_old`, `test`.`
_schema_change_new` TO `test`.`schema_change`

3. Schema change files are generated by default in directory 
tungsten-replicator/log/schemachanges/<service>.  You can change this using the 
notifyDir property, as in the following example: 

 --property=replicator.filter.monitorschemachange.notifyDir=/tmp/myschemachanges

To test these features just enable both schemachange and monitorschemafilters. 
You can see the pt-online-schema-change behavior using a script like the 
following: 

mysql -uroot <<SQL
drop table if exists test.schema_change;
create table test.schema_change (id int primary key, data1 varchar(20));
insert into test.schema_change values(1, 'one');
SQL
pt-online-schema-change --user tungsten --password secret --alter 'add column 
ts1 timestamp' --execute D=test,t=schema_change

Original comment by robert.h...@continuent.com on 31 May 2014 at 6:04

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 31 May 2014 at 6:11

Changed state: QA

GoogleCodeExporter commented 9 years ago

This issue is just the beginning of efforts to track schema changes accurately. 
 The current implementation solves one specific use case, namely to provide a 
reasonably accurate indication when it is necessary to resqoop data to Hadoop 
following a schema change in an upstream MySQL server.  

We can add additional improvements for other DBMS types as well as for Hadoop 
loading based on field experience and specific use cases.

Original comment by robert.h...@continuent.com on 31 May 2014 at 6:14

GoogleCodeExporter commented 9 years ago

In the low-hanging-fruit department, would it be a good idea to add the seqNo 
to the notify file name?

The reasoning for this addition is that we process CSV files using files made 
of table name + seqNo. If the notify files did include a seqNo, both these 
files could be used for automated operations outside of the existing tools.

Would such addition clash with what we are using so far?

Original comment by g.maxia on 17 Jun 2014 at 12:45

GoogleCodeExporter commented 9 years ago

In fact, if we don't add a seqNo to the table name, every DDL operation on such 
table will overwrite the previous operation. 

CREATE TABLE sentinel ...
ALTER TABLE sentinel ...
TRUNCATE sentinel 

The above statements will result in only one notify file, containing the latest 
command.

 Is this behavior intentional?

Original comment by g.maxia on 17 Jun 2014 at 1:11

Changed state: Started

GoogleCodeExporter commented 9 years ago

Adding the seqno is a good idea.  I can do this as part of final cleanup if 
others agree.

Original comment by robert.h...@continuent.com on 28 Aug 2014 at 4:54

GoogleCodeExporter commented 9 years ago

The seqno issue looks useful to field users--I had a call today that discussed 
this.  I have committed a fix so that each generated file has the following 
form: 

<schema>.<table>.<seqno>

This makes it easy for scripts to identify changes quickly.   At this point I 
believe the implementation is complete.

Original comment by robert.h...@continuent.com on 28 Aug 2014 at 11:58

Changed state: QA

GoogleCodeExporter commented 9 years ago

The implementation is complete.
This feature behaves as described. 
There is no explicit test for it yet.

Original comment by g.maxia on 18 Sep 2014 at 11:34

Changed state: Documenting
Added labels: Test-Needed

GoogleCodeExporter commented 9 years ago

A note has been added to the 3.0 release notes

Original comment by mc.br...@continuent.com on 20 Oct 2014 at 1:29

Changed state: Fixed

epermana / tungsten-replicator

Enable replicators to track schema changes using filters and force commits between commands that arrive on replicas #868