confluentinc / kafka-connect-jdbc

Kafka Connect connector for JDBC-compatible databases
Other
19 stars 956 forks source link

JDBC Source Connector | Timestamp Mode | Where clause incorrect | confluent-oss-5.0.0-2.11 #640

Open aliasbadwolf opened 5 years ago

aliasbadwolf commented 5 years ago

Confluent Platform Type/Version: confluent-oss-5.0.0-2.11

Am running a JDBC source connector (against MySQL as well as Oracle) in timestamp mode and the where condition connector is appending seems to be incorrect. It is making both the start and end time exclusive. This is causing records that have been committed at that exact time not to be sourced into the topics (which is obvious that they will be skipped and there is not inclusion of one boundary condition).

The table I am testing in local is a date dimension which has dates in various formats from 1900-01-01 00:00:00 to 2030-12-31 23:59:59. Initially added just 1000 rows to the table and started the connector with below config:

{
    "name": "xyz_datetime_dim",
    "config": {
        "connector.class": "JdbcSourceConnector",
        "connection.url": "jdbc:mysql://127.0.0.1:3306?user=root&password=admin123&useSSL=false&useUnicode=true&serverTimezone=UTC",
        "tasks.max": "1",
        "connection.user": "root",
        "connection.password": "admin123",
        "connection.attempts": 3,
        "connection.backoff.ms": 10000,
        "schema.pattern": "xyz",
        "table.whitelist": "datetime_dim",
        "table.balcklist": "",
        "numeric.mapping": "best_fit",
        "mode": "timestamp",
        "incrementing.column.name": "",
        "timestamp.column.name": "full_date",
        "validate.non.null": false,
        "query": "",
        "topic.prefix": "xyz-",
        "table_types": "TABLE",
        "poll.interval.ms": 10000,
        "batch.max.rows": 10,
        "table.poll.interval.ms": 60000,
        "timestamp.delay.interval.ms": 0
    }
}

i have updated the offset for the Connector to start at "1900-01-01 00:00:00" which in epoch is "-2208988800000":

Start offset value for connector in connect-offsets topic: ["xyz_datetime_dim",{"protocol":"1","table":"xyz.datetime_dim"}] |> {"timestamp_nanos":0,"timestamp":-2208988800000}

The connector ran fine but only "999" records got loaded which is expected due to the query that is getting built with erroneous WHERE clause. here is the query:

[2019-04-11 15:03:21,074] DEBUG TimestampIncrementingTableQuerier{table="xyz"."datetime_dim", query='null', topicPrefix='xyz-', incrementingColumn='', timestampColumns=[full_date]} prepared SQL query: SELECT * FROM `xyz`.`datetime_dim` **WHERE `xyz`.`datetime_dim`.`full_date` > ? AND `xyz`.`datetime_dim`.`full_date` < ?** ORDER BY `xyz`.`datetime_dim`.`full_date` ASC (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)

I can start the connector to start 1 second (or millisecond) less than the epoch time I am setting the connector to start with but eventually there will be records that will be committed at the same timestamp which is being used for start or end time and once that happens those records will NOT be sourced in.

For example: the CURRENT_TIMESTAMP (used as end time) that Querier got from DB was:

[2019-04-11 15:03:21,074] DEBUG executing query SELECT CURRENT_TIMESTAMP to get current time from database (io.confluent.connect.jdbc.dialect.MySqlDatabaseDialect)
[2019-04-11 15:03:21,075] DEBUG Executing prepared statement with timestamp value = 1902-09-27 00:00:00.000 end time = 2019-04-11 15:03:21.000 (io.confluent.connect.jdbc.source.TimestampIncrementingCriteria)

If there will be any commits that happened at that exact timestamp value or if the DB does not even have millisecond or nanosecond level precision, the records will be omitted.

Here is the output (last record) after reading the topic:

{"string":""},"ddmonyyyy":{"string":"23SEP1902"}}
null |> {"date_cd":263140,"date_key":{"long":999},"date_type":{"string":"Normal"},"full_date":{"long":-2122761600000},"day_num_of_week":{"int":6},"day_num_of_month":{"int":26},"day_num_of_quarter":{"int":88},"day_num_of_year":{"int":269},"day_num_absolute":{"int":999},"day_of_week_name":{"string":"FRIDAY"},"day_of_week_abbreviation":{"string":"FRI"},"julian_day_num_of_year":{"long":1902269},"julian_day_num_absolute":{"long":1902000999},"is_weekday":{"string":"Y"},"is_us_civil_holiday":{"string":"N"},"is_last_day_of_week":{"string":"N"},"is_last_day_of_month":{"string":"N"},"is_last_day_of_quarter":{"string":"N"},"is_last_day_of_year":{"string":"N"},"is_last_day_of_fiscal_month":{"string":"N"},"is_last_day_of_fiscal_quarter":{"string":"N"},"is_last_day_of_fiscal_year":{"string":"N"},"prev_day_date":null,"prev_day_date_key":{"long":0},"same_weekday_year_ago_date":null,"same_weekday_year_ago_date_key":{"long":0},"week_of_year_begin_date":{"long":-2123193600000},"week_of_year_begin_date_key":{"long":994},"week_of_year_end_date":{"long":-2122675200000},"week_of_year_end_date_key":{"long":1000},"week_of_month_begin_date":{"long":-2123107200000},"week_of_month_begin_date_key":{"long":995},"week_of_month_end_date":{"long":-2122588800000},"week_of_month_end_date_key":{"long":1001},"week_of_quarter_begin_date":{"long":-2123020800000},"week_of_quarter_begin_date_key":{"long":996},"week_of_quarter_end_date":{"long":-2122502400000},"week_of_quarter_end_date_key":{"long":1002},"week_num_of_month":{"int":4},"week_num_of_quarter":{"int":13},"week_num_of_year":{"int":39},"week_num_overall":{"int":0},"month_num_of_year":{"int":9},"month_num_overall":{"int":0},"month_name":{"string":"SEPTEMBER"},"month_name_abbreviation":{"string":"SEP"},"month_begin_date":{"long":-2124921600000},"month_begin_date_key":{"long":974},"month_end_date":{"long":-2122416000000},"month_end_date_key":{"long":1003},"quarter_num_of_year":{"int":3},"quarter_num_overall":{"int":11},"quarter_begin_date":{"long":-2130278400000},"quarter_begin_date_key":{"long":912},"quarter_end_date":{"long":-2122416000000},"quarter_end_date_key":{"long":1003},"year_num":{"int":1902},"year_begin_date":{"long":-2145916800000},"year_begin_date_key":{"long":731},"year_end_date":{"long":-2114467200000},"year_end_date_key":{"long":1095},"yyyymm":{"string":"190209"},"yyyymmdd":{"string":"19020926"},"ddmonyy":{"string":""},"ddmonyyyy":{"string":"26SEP1902"}}
^CProcessed a total of 999 messages

Ideally the query with correct WHERE clause should have been :

SELECT * FROM `xyz`.`datetime_dim` WHERE `xyz`.`datetime_dim`.`full_date` >**=** ? AND `xyz`.`datetime_dim`.`full_date` < ? ORDER BY `xyz`.`datetime_dim`.`full_date` ASC 

I believe this should be resolved as a major bug.

aliasbadwolf commented 5 years ago

@rmoff Not sure if this is the correct forum/place to open a bug. Also, the same issue is present in 5.1.2 as well.