Closed frozensky closed 9 months ago
[UPDATE op] in order to enable updates in the target table, you need to add --writetime-column regular_column
parameter to your run command, e.g., ./cqlreplicator --state run --tiles 2 --writetime-column reg_column --landing-zone s3://cql-replicator-1234567890-us-east-1 --src-keyspace ks1 --src-table tbl1 --trg-keyspace ks1 --trg-table tbl1 --region us-east-1
.
Note: regular column is not part of the primary key, and should be updated, currently cqlreplicator doesn't support multiple regular columns. Before re-running cleanup
and executeinit
againt.
[DELETE op] currently cqlreplicator doesn't support DELETEs, but this feature will be available in Feb 2024.
thank you for the detail response we have multiple regular columns that the service/app update frequently, any suggest or options that we can try ?
Do you have one or more columns in your source table that are constantly updated, e.g., status
, dt_updated
, and etc. if you have, use one of them to replicate updates --writetime-column dt_updated
or --writetime-column status
, cqlreplicator
updates all columns in your target table if your writetime-column is changed.
I am testing the --writetime-column and looks like the discovery job ran ok, but the replicator job error out with the following error
2024-01-26 02:03:30,080 ERROR [spark-listener-group-shared] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis]
{
"Event": "GlueExceptionAnalysisTaskFailed",
"Timestamp": 1706234610074,
"Failure Reason": "Error decoding JSON value for adventure: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: ",
"Stack Trace": [
I see the proxy file name is null
{
"Declaring Class": "com.sun.proxy.$Proxy71",
"Method Name": "execute",
"File Name": null,
"Line Number": -1
},
is that the binary column? Could you share your masked table schema?
here is the table schema from source
cqlsh> describe table personalcatalog.my_movies;
CREATE TABLE personalcatalog.my_movies (
accountid text PRIMARY KEY,
a_bucket blob,
action blob,
adventure blob,
animation blob,
avod blob,
b_bucket blob,
c_bucket blob,
comedy blob,
crime_thrillers blob,
d_bucket blob,
documentary blob,
drama blob,
e_bucket blob,
f_bucket blob,
fantasy blob,
g_bucket blob,
greater_than_60_pct blob,
greater_than_80_pct blob,
h_bucket blob,
historical blob,
horror blob,
i_bucket blob,
independent blob,
international blob,
j_bucket blob,
k_bucket blob,
kids_family blob,
l_bucket blob,
m_bucket blob,
modification_time timestamp,
music_musical blob,
n_bucket blob,
numeric_bucket blob,
o_bucket blob,
other blob,
p_bucket blob,
purchase_time_bucket_1 text,
purchase_time_bucket_10 text,
purchase_time_bucket_2 text,
purchase_time_bucket_3 text,
purchase_time_bucket_4 text,
purchase_time_bucket_5 text,
purchase_time_bucket_6 text,
purchase_time_bucket_7 text,
purchase_time_bucket_8 text,
purchase_time_bucket_9 text,
q_bucket blob,
r_bucket blob,
reality blob,
rentals text,
romance blob,
s_bucket blob,
sci_fi blob,
special_char_bucket blob,
sports blob,
superheroes blob,
suspense blob,
t_bucket blob,
total_count int,
u_bucket blob,
v_bucket blob,
w_bucket blob,
western blob,
x_bucket blob,
y_bucket blob,
z_bucket blob
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
I reproduced the error in CQLSH
if the blob value is empty:
localhost@cqlsh:ks> create table test1 (pk int PRIMARY KEY, bin blob ) ;
localhost@cqlsh:ks> insert into test1 JSON '{"pk": 1, "bin":""}';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Error decoding JSON value for bin: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: "
Empty string in binary should be like this insert into test1 JSON '{"pk": 1, "bin":"0x"}'
I think the problem in C JSON serializer:
`select json from test1
[json]
----------------------
{"pk": 1, "bin": ""}`
@frozensky Added support for DELETEs and fixed the issue with JSON serialization with "0x" empty blob columns. Could you please validate.
1/ Run --state cleanup
2/ Clone the git repository
3/ Run --state run
4/ Validate --state stats
Thank you for the quick turn around, I see the replicator running now I am going to replicate more tables and run some test for the update and delete.
./cqlreplicator --state run --tiles 2 --writetime-column modification_time --landing-zone s3://hnt-cqlreplicator-stg --region us-west-1 --src-keyspace personalcatalog --src-table my_movies --trg-keyspace stg_hnt_personalcatalog --trg-table my_movies --inc-traffic
·······································································
[2024-01-27T16:37:43-08:00] OS: Darwin
[2024-01-27T16:37:43-08:00] Incremental traffic for the historical workload is enabled
[2024-01-27T16:37:43-08:00] Incremental period: 240 seconds
[2024-01-27T16:37:43-08:00] Starting discovery process...
[2024-01-27T16:37:43-08:00] TILES: 2
[2024-01-27T16:37:43-08:00] SOURCE: personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] TARGET: stg_hnt_personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] LANDING ZONE: s3://hnt-cqlreplicator-stg
[2024-01-27T16:37:43-08:00] WRITE TIME COLUMN: modification_time
[2024-01-27T16:37:43-08:00] TTL COLUMN: None
[2024-01-27T16:37:43-08:00] ROWS PER DPU: 250000
[2024-01-27T16:37:43-08:00] Checking if the discovery job is already running...
{
"DeleteMarker": true,
"VersionId": "null"
}
[2024-01-27T16:37:45-08:00] Starting the discovery job...
[2024-01-27T16:39:26-08:00] Average primary keys per tile is 127117
{
"DeleteMarker": true,
"VersionId": "null"
}
{tarting Glue Jobs : [||||||||||||||||||||--------------------] 50.00%
"DeleteMarker": true,
"VersionId": "null"
}
Starting Glue Jobs : [||||||||||||||||||||||||||||||||||||||||] 100.00% - COMPLETED
[2024-01-27T16:47:34-08:00] Started jobs: jr_18cdce4b61f938848e2d005db7c103d044045990d2d2c9fe5285f4b7831b9b91 jr_3edc5591f56a7b64aff12f06e2e805d82d855f4e497956027656c7f63c3eeea5 jr_e945898cbe3b6bd5aa43b4b3f78371379d53b779dbf1f0c4ef6d09ff96b7d72b
$ ./cqlreplicator --state stats --landing-zone s3://hnt-cqlreplicator-stg --src-keyspace personalcatalog --src-tabl
e my_movies
·······································································
[2024-01-27T16:47:57-08:00] OS: Darwin
[2024-01-27T16:48:05-08:00] Discovered rows in personalcatalog.my_movies is 254274
[2024-01-27T16:48:05-08:00] Replicated rows in ks_test_cql_replicator.test_cql_replicator is 0
from source cassandra
cqlsh> SELECT * FROM personalcatalog.my_movies WHERE accountid='2772378';
accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1 | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
2772378 | null | 0x01 | 0x02 | 0x | 0x01 | null | null | 0x | 0x | null | 0x | 0x02 | null | null | 0x | null | 0x07 | 0x07 | null | 0x | 0x | null | 0x | 0x | null | null | 0x | null | null | 2023-10-27 11:23:01.461000+0000 | 0x | null | null | null | 0x | null | ["150674","136243","136100"] | null | null | null | null | null | null | null | null | null | null | null | 0x | [] | 0x | null | 0x04 | null | 0x | 0x | 0x01 | null | 3 | null | null | null | 0x | null | null | null
(1 rows)
AWS keyspaces
cqlsh> SELECT * FROM stg_hnt_personalcatalog.my_movies WHERE accountid='2772378';
accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1 | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
2772378 | null | 0x01 | 0x02 | 0x | 0x01 | null | null | 0x | 0x | null | 0x | 0x02 | null | null | 0x | null | 0x07 | 0x07 | null | 0x | 0x | null | 0x | 0x | null | null | 0x | null | null | 2023-10-27 11:23:01.461000+0000 | 0x | null | null | null | 0x | null | ["150674","136243","136100"] | null | null | null | null | null | null | null | null | null | null | null | 0x | [] | 0x | null | 0x04 | null | 0x | 0x | 0x01 | null | 3 | null | null | null | 0x | null | null | null
(1 rows)
confirm UPDATEs and DELETEs are being replicate, testing more tables
I am testing out the tool to migrate data from datacenter cassandra to AWS keyspaces. I only see insert new rows are being replicate into keyspaces, both update and delete didn't get replicate
my test cases 1, insert new row - replicated 2, update row - not replicate 3, delete row - not replicate 4, stop cqlreplicator, insert row and update the same row (2 steps), start cqlreplicator - both replicated