epermana / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
1 stars 0 forks source link

replication to hadoop gets wrong timestamp for DST #1068

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. install replication from MySQL to Hadoop using TR 3.0.1-30
2. In the master, insert a timestamp of a time skipped by DST change 

mysql> create table timetest(id int not null primary key, ts timestamp);
Query OK, 0 rows affected (0.01 sec)

mysql> set autocommit=0;
Query OK, 0 rows affected (0.00 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into timetest values (1, '2014-11-19 08:50:11');
Query OK, 1 row affected (0.00 sec)

mysql> insert into timetest values (2, '2014-03-09 02:30:35');
Query OK, 1 row affected (0.00 sec)

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from timetest;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-11-19 08:50:11 |
|  2 | 2014-03-09 02:30:35 |
+----+---------------------+

3. Check the results in the slave

What is the expected output?

The timestamps are replicated correctly in all three stages of Hadoop data 
transfer (csv file, stage table, final table)

What do you see instead?

The CSV has the right values:
hadoop fs -cat staging/batch/test/timetest/timetest*.csv |perl -pe 's/\x01/ | 
/g'
I | 2 | 1 | 2014-12-06 14:55:49.000 | 1 | 2014-11-19 08:50:11.000
I | 2 | 2 | 2014-12-06 14:55:49.000 | 2 | 2014-03-09 02:30:35.000

# (after running the map-reduce script)
# ./bin/load-reduce-check -U jdbc:mysql:thin://ubuntu1:3306/test -s test  -r 
/opt/qa/tungsten -S batch

The stage table has the right values:
hive> select * from stage_xxx_timetest;
OK
I   2   1   2014-12-06 14:55:49 1   2014-11-19 08:50:11
I   2   2   2014-12-06 14:55:49 2   2014-03-09 02:30:35 
Time taken: 1.132 seconds

The final table has a wrong value:

hive> select * from timetest;
OK
1   2014-11-19 08:50:11
2   2014-03-09 03:30:35 # <===
Time taken: 0.134 seconds

What is the possible cause?

N/A

What is the proposed solution?

N/A

Additional information

Installation command:

./tools/tpm configure defaults \
    --reset \
    --java-file-encoding=UTF8 \
    --java-user-timezone=GMT \
    --skip-validation-check=HostsFileCheck \
    --start-and-report

./tools/tpm configure batch \
    --hosts=ubuntu1 \
    --master=ubuntu1 \
    --members=ubuntu1,hadoop1 \
    --replication-user=tungsten \
    --replication-password=secret \
    --home-directory=$TUNGSTEN_HOME \
    --datasource-mysql-data-directory=/var/lib/mysql \
    --disable-relay-logs=true \
    --datasource-log-pattern=mysql-bin \
    --datasource-mysql-conf=/etc/mysql/my.cnf \
    --mysql-enable-enumtostring=true \
    --mysql-enable-settostring=true \
    --mysql-use-bytes-for-string=false \
    --rmi-port=10100 \
    --svc-extractor-filters=colnames,pkey,schemachange \
    --property=replicator.filter.pkey.addPkeyToInserts=true \
    --property=replicator.filter.pkey.addColumnsToDeletes=true \
    --skip-validation-check=ReplicationServicePipelines

./tools/tpm configure batch \
    --hosts=hadoop1 \
    --master=ubuntu1 \
    --members=ubuntu1,hadoop1 \
    --datasource-type=file \
    --rmi-port=10100 \
    --property=replicator.stage.q-to-dbms.blockCommitInterval=10s \
    --property=replicator.stage.q-to-dbms.blockCommitRowCount=30000 \
    --property=replicator.stage.q-to-dbms.blockCommitPolicy=lax \
    --svc-applier-filters=monitorschemachange \
    --batch-enabled=true \
    --property=replicator.applier.dbms.parallelization=3 \
    --batch-load-template=hadoop \
    --property=replicator.datasource.applier.csvType=hive \
    --property=replicator.applier.dbms.parallelization=1 \
    --install-directory=$TUNGSTEN_HOME \
    --skip-validation-check=InstallerMasterSlaveCheck \
    --skip-validation-check=DirectDatasourceDBPort \
    --skip-validation-check=DatasourceDBPort \
    --metadata-directory=/opt/qa/tungsten/customdir

./tools/tpm install

trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000002:0000000000000770;-1
appliedLastSeqno       : 2
appliedLatency         : 0.099
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : batch
currentEventId         : mysql-bin.000002:0000000000000770
currentTimeMillis      : 1417878178742
dataServerHost         : ubuntu1
extensions             :
host                   : ubuntu1
latestEpochNumber      : 0
masterConnectUri       : thl://localhost:/
masterListenUri        : thl://ubuntu1:2112/
maximumStoredSeqNo     : 2
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : /var/lib/mysql
relativeLatency        : 429.742
resourcePrecedence     : 99
rmiPort                : 10100
role                   : master
seqnoType              : java.lang.Long
serviceName            : batch
serviceType            : local
simpleServiceName      : batch
siteName               : default
sourceId               : ubuntu1
state                  : ONLINE
timeInStateSeconds     : 746.9
timezone               : GMT
transitioningTo        :
uptimeSeconds          : 747.862
useSSLConnection       : false
version                : Tungsten Replicator 3.0.1 build 30
Finished status command...

trepctl -host hadoop1 status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000002:0000000000000770;-1
appliedLastSeqno       : 2
appliedLatency         : 5.772
autoRecoveryEnabled    : false
autoRecoveryTotal      : 0
channels               : 1
clusterName            : batch
currentEventId         : NONE
currentTimeMillis      : 1417878293241
dataServerHost         : hadoop1
extensions             :
host                   : hadoop1
latestEpochNumber      : 0
masterConnectUri       : thl://ubuntu1:2112/
masterListenUri        : null
maximumStoredSeqNo     : 2
minimumStoredSeqNo     : 0
offlineRequests        : NONE
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://ubuntu1:2112/
relativeLatency        : 544.241
resourcePrecedence     : 99
rmiPort                : 10100
role                   : slave
seqnoType              : java.lang.Long
serviceName            : batch
serviceType            : local
simpleServiceName      : batch
siteName               : default
sourceId               : hadoop1
state                  : ONLINE
timeInStateSeconds     : 852.028
timezone               : GMT
transitioningTo        :
uptimeSeconds          : 857.094
useSSLConnection       : false
version                : Tungsten Replicator 3.0.1 build 30
Finished status command...

Original issue reported on code.google.com by g.maxia on 6 Dec 2014 at 3:06

GoogleCodeExporter commented 8 years ago
Great test.  This could be a problem within Hadoop itself, since the replicator 
deposited the value correctly in th staging table.  What happens if you insert 
such a value directly into hive and then select it back?  Ideally you should 
select using a key as this will force map/reduce and run the date through the 
full process we are using to generate materialized views.  

(It is possible Hadoop has some of the exact time time issues we have been 
correcting the replicator.)

Original comment by robert.h...@continuent.com on 6 Dec 2014 at 4:31

GoogleCodeExporter commented 8 years ago
This is an issue with Hive re-interpreting the timestamp value during the 
map/reduce process.

During Map/Reduce we effectively do an INSERT INTO ... SELECT FROM

With Hive, that means we interpret the table data according to the set 
datatype, then insert the interpreted value back into the new table. 

The easiest change is to alter our velocity templates used by dllscan and set 
TIMESTAMP to be treated as STRING types. This will prevent the interpretation. 

Longer term, we should change the map/reduce to explicitly interpret TIMESTAMP 
values as strings (which side-steps the interpretation), but this needs to be 
considered alongside any changes we make to the supported target databases in 
Hadoop; HBase and Impala probably do not suffer the same problems.

Original comment by mc.br...@continuent.com on 9 Dec 2014 at 3:51

GoogleCodeExporter commented 8 years ago

Original comment by linas.vi...@continuent.com on 11 Dec 2014 at 2:43

GoogleCodeExporter commented 8 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18