apache / flink-cdc

Flink CDC is a streaming data integration tool
https://nightlies.apache.org/flink/flink-cdc-docs-stable
Apache License 2.0
5.67k stars 1.92k forks source link

[mysql] in SQL mode, local timezone is used in the snapshot phase, but UTC timezone is used in streaming phase by default #1406

Closed kylemeow closed 2 years ago

kylemeow commented 2 years ago

Describe the bug Hi Devs,

I have found that in SQL mode, MySqlSourceOptions#SERVER_TIME_ZONE is set to use UTC by default, which is reasonable as this product might be used by users around the globe.

However, it seems that this timezone setting is only effective in the streaming phase, while in the initial snapshot phase, the local timezone is still used (e.g. UTC+8), causing data inconsistency and puzzles for users.

1658805730907 1658805998265

Environment :

To Reproduce Steps to reproduce the behavior:

  1. Thes test data : Any MySQL tables with Timestamp fields and non-UTC local timezone, and there should be some existing data for the initial snapshot phase.
  2. The test code : Simple ETL case (INSERT INTO sink SELECT * FROM source)
  3. The error : Inconsistent timestamp records written to the sink.

The snapshot phase, local timezone (UTC+8 is used): image

The streaming phase, UTC is used: image

Additional Description Would like to see if the 'table.local-time-zone' in TableConfig could be used to determine the Debezium server timezone 1658805705062

chrofram commented 2 years ago

However, it seems that this timezone setting is only effective in the streaming phase, while in the initial snapshot phase, the local timezone is still used (e.g. UTC+8), causing data inconsistency and puzzles for users.

Boss, is this the opposite?

When I tested it, server-time-zone can only control the time zone during the snapshot, no matter how it is modified, it cannot change the time zone during the increment. And when server-time-zone = UTC is specified, the opposite problem will occur, the time zone during the snapshot is UTC, and the time zone during the increment is still the default system time zone.

kylemeow commented 2 years ago

However, it seems that this timezone setting is only effective in the streaming phase, while in the initial snapshot phase, the local timezone is still used (e.g. UTC+8), causing data inconsistency and puzzles for users.

Boss, is this the opposite?

When I tested it, server-time-zone can only control the time zone during the snapshot, no matter how it is modified, it cannot change the time zone during the increment. And when server-time-zone = UTC is specified, the opposite problem will occur, the time zone during the snapshot is UTC, and the time zone during the increment is still the default system time zone.

Hi @chrofram , thanks for sharing your experiences with this issue. Based on what I have observed, only the streaming phase is affected by the server-time-zone option in my environment, at least for TIMESTAMP fields.

May I know the CREATE TABLE statements for your MySQL table?

chrofram commented 2 years ago

Thanks for your reply, here is my MysqlTable:

CREATE TABLE test ( id bigint(20) NOT NULL AUTO_INCREMENT, updated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, create_date datetime DEFAULT NULL, check_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id) ) ENGINE=InnoDB AUTO_INCREMENT=58009 DEFAULT CHARSET=utf8;

And here is my FlinkCDCconfiguration: 'connector' = 'mysql-cdc', 'hostname' = '', 'port' = '3306', 'username' = '', 'password' = '***', 'database-name' = 'cdctest', 'table-name' = 'test', 'server-time-zone' = 'UTC'

In my tests, datetime types are not affected by timezone, but fields of type timestamp are only affected by the server-time-zone configuration during snapshots. . . When the snapshot ends, subsequent streaming data is still processed according to the default time zone.

I think I probably know what the problem is. I am using JsonDebeziumDeserializationSchema and need to deal with the time-zone correspondence.

kylemeow commented 2 years ago

Thanks for your reply, here is my MysqlTable:

CREATE TABLE test ( id bigint(20) NOT NULL AUTO_INCREMENT, updated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, create_date datetime DEFAULT NULL, check_time datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id) ) ENGINE=InnoDB AUTO_INCREMENT=58009 DEFAULT CHARSET=utf8;

And here is my FlinkCDCconfiguration: 'connector' = 'mysql-cdc', 'hostname' = '', 'port' = '3306', 'username' = '', 'password' = '***', 'database-name' = 'cdctest', 'table-name' = 'test', 'server-time-zone' = 'UTC'

In my tests, datetime types are not affected by timezone, but fields of type timestamp are only affected by the server-time-zone configuration during snapshots. . . When the snapshot ends, subsequent streaming data is still processed according to the default time zone.

I think I probably know what the problem is. I am using JsonDebeziumDeserializationSchema and need to deal with the time-zone correspondence.

Hi @chrofram , sadly I still cannot reproduce your case in my own environment.

As we know, Debezium internally reads Timestamp data in UTC during the streaming phase (like '2022-09-20T08:14:00Z'), and later Flink CDC Connector converts the UTC timestamp string into LocalDateTime via com.ververica.cdc.debezium.utils.TemporalConversions#toLocalDateTime with the user-defined server-time-zone, so maybe you can add a breakpoint at this method and debug to see how the value is transformed.

chrofram commented 2 years ago

@kylemeow ,Thank you so much for your guidance! After my test, when I use flinkCDC native SQL, the situation is as you said; and when I use streamAPI, because I use JsonDebeziumDeserializationSchema instead of RowDataDebeziumDeserializationSchema that native SQL uses by default, and I didn't handle the conversion of various types of time in my code, so the opposite occurs in the question of time zones. Sorry for taking up your time, and thanks again.