aws-samples / cql-replicator

CQLReplicator is a migration tool that helps you to replicate data from Cassandra to AWS Services
Apache License 2.0
15 stars 8 forks source link

[CQLReplicator on Glue] only insert of new row being replicate, update and delete don't get replicate #79

Closed frozensky closed 7 months ago

frozensky commented 7 months ago

I am testing out the tool to migrate data from datacenter cassandra to AWS keyspaces. I only see insert new rows are being replicate into keyspaces, both update and delete didn't get replicate

my test cases 1, insert new row - replicated 2, update row - not replicate 3, delete row - not replicate 4, stop cqlreplicator, insert row and update the same row (2 steps), start cqlreplicator - both replicated

nwheeler81 commented 7 months ago

[UPDATE op] in order to enable updates in the target table, you need to add --writetime-column regular_column parameter to your run command, e.g., ./cqlreplicator --state run --tiles 2 --writetime-column reg_column --landing-zone s3://cql-replicator-1234567890-us-east-1 --src-keyspace ks1 --src-table tbl1 --trg-keyspace ks1 --trg-table tbl1 --region us-east-1. Note: regular column is not part of the primary key, and should be updated, currently cqlreplicator doesn't support multiple regular columns. Before re-running cleanup and executeinit againt. [DELETE op] currently cqlreplicator doesn't support DELETEs, but this feature will be available in Feb 2024.

frozensky commented 7 months ago

thank you for the detail response we have multiple regular columns that the service/app update frequently, any suggest or options that we can try ?

nwheeler81 commented 7 months ago

Do you have one or more columns in your source table that are constantly updated, e.g., status, dt_updated, and etc. if you have, use one of them to replicate updates --writetime-column dt_updated or --writetime-column status, cqlreplicator updates all columns in your target table if your writetime-column is changed.

frozensky commented 7 months ago

I am testing the --writetime-column and looks like the discovery job ran ok, but the replicator job error out with the following error

2024-01-26 02:03:30,080 ERROR [spark-listener-group-shared] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] 
{
    "Event": "GlueExceptionAnalysisTaskFailed",
    "Timestamp": 1706234610074,
    "Failure Reason": "Error decoding JSON value for adventure: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: ",
    "Stack Trace": [

I see the proxy file name is null

        {
            "Declaring Class": "com.sun.proxy.$Proxy71",
            "Method Name": "execute",
            "File Name": null,
            "Line Number": -1
        },
nwheeler81 commented 7 months ago

is that the binary column? Could you share your masked table schema?

frozensky commented 7 months ago

here is the table schema from source

cqlsh> describe table personalcatalog.my_movies;

CREATE TABLE personalcatalog.my_movies (
    accountid text PRIMARY KEY,
    a_bucket blob,
    action blob,
    adventure blob,
    animation blob,
    avod blob,
    b_bucket blob,
    c_bucket blob,
    comedy blob,
    crime_thrillers blob,
    d_bucket blob,
    documentary blob,
    drama blob,
    e_bucket blob,
    f_bucket blob,
    fantasy blob,
    g_bucket blob,
    greater_than_60_pct blob,
    greater_than_80_pct blob,
    h_bucket blob,
    historical blob,
    horror blob,
    i_bucket blob,
    independent blob,
    international blob,
    j_bucket blob,
    k_bucket blob,
    kids_family blob,
    l_bucket blob,
    m_bucket blob,
    modification_time timestamp,
    music_musical blob,
    n_bucket blob,
    numeric_bucket blob,
    o_bucket blob,
    other blob,
    p_bucket blob,
    purchase_time_bucket_1 text,
    purchase_time_bucket_10 text,
    purchase_time_bucket_2 text,
    purchase_time_bucket_3 text,
    purchase_time_bucket_4 text,
    purchase_time_bucket_5 text,
    purchase_time_bucket_6 text,
    purchase_time_bucket_7 text,
    purchase_time_bucket_8 text,
    purchase_time_bucket_9 text,
    q_bucket blob,
    r_bucket blob,
    reality blob,
    rentals text,
    romance blob,
    s_bucket blob,
    sci_fi blob,
    special_char_bucket blob,
    sports blob,
    superheroes blob,
    suspense blob,
    t_bucket blob,
    total_count int,
    u_bucket blob,
    v_bucket blob,
    w_bucket blob,
    western blob,
    x_bucket blob,
    y_bucket blob,
    z_bucket blob
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99PERCENTILE';
nwheeler81 commented 7 months ago

I reproduced the error in CQLSH if the blob value is empty: localhost@cqlsh:ks> create table test1 (pk int PRIMARY KEY, bin blob ) ; localhost@cqlsh:ks> insert into test1 JSON '{"pk": 1, "bin":""}'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Error decoding JSON value for bin: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: "

Empty string in binary should be like this insert into test1 JSON '{"pk": 1, "bin":"0x"}'

I think the problem in C JSON serializer: `select json from test1 [json] ---------------------- {"pk": 1, "bin": ""}`

nwheeler81 commented 7 months ago

@frozensky Added support for DELETEs and fixed the issue with JSON serialization with "0x" empty blob columns. Could you please validate. 1/ Run --state cleanup 2/ Clone the git repository 3/ Run --state run 4/ Validate --state stats

frozensky commented 7 months ago

Thank you for the quick turn around, I see the replicator running now I am going to replicate more tables and run some test for the update and delete.

./cqlreplicator --state run --tiles 2 --writetime-column modification_time --landing-zone s3://hnt-cqlreplicator-stg --region us-west-1 --src-keyspace personalcatalog --src-table my_movies --trg-keyspace stg_hnt_personalcatalog --trg-table my_movies --inc-traffic
·······································································
[2024-01-27T16:37:43-08:00] OS: Darwin
[2024-01-27T16:37:43-08:00] Incremental traffic for the historical workload is enabled
[2024-01-27T16:37:43-08:00] Incremental period: 240 seconds
[2024-01-27T16:37:43-08:00] Starting discovery process...
[2024-01-27T16:37:43-08:00] TILES: 2
[2024-01-27T16:37:43-08:00] SOURCE: personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] TARGET: stg_hnt_personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] LANDING ZONE: s3://hnt-cqlreplicator-stg
[2024-01-27T16:37:43-08:00] WRITE TIME COLUMN: modification_time
[2024-01-27T16:37:43-08:00] TTL COLUMN: None
[2024-01-27T16:37:43-08:00] ROWS PER DPU: 250000
[2024-01-27T16:37:43-08:00] Checking if the discovery job is already running...
{

    "DeleteMarker": true,
    "VersionId": "null"
}
[2024-01-27T16:37:45-08:00] Starting the discovery job...
[2024-01-27T16:39:26-08:00] Average primary keys per tile is 127117
{
    "DeleteMarker": true,
    "VersionId": "null"
}
{tarting Glue Jobs : [||||||||||||||||||||--------------------] 50.00%
    "DeleteMarker": true,
    "VersionId": "null"
}
Starting Glue Jobs : [||||||||||||||||||||||||||||||||||||||||] 100.00% - COMPLETED
[2024-01-27T16:47:34-08:00] Started jobs: jr_18cdce4b61f938848e2d005db7c103d044045990d2d2c9fe5285f4b7831b9b91 jr_3edc5591f56a7b64aff12f06e2e805d82d855f4e497956027656c7f63c3eeea5 jr_e945898cbe3b6bd5aa43b4b3f78371379d53b779dbf1f0c4ef6d09ff96b7d72b
$ ./cqlreplicator --state stats --landing-zone s3://hnt-cqlreplicator-stg --src-keyspace personalcatalog --src-tabl
e my_movies
·······································································
[2024-01-27T16:47:57-08:00] OS: Darwin
[2024-01-27T16:48:05-08:00] Discovered rows in personalcatalog.my_movies is 254274
[2024-01-27T16:48:05-08:00] Replicated rows in ks_test_cql_replicator.test_cql_replicator is 0

from source cassandra

cqlsh> SELECT * FROM personalcatalog.my_movies WHERE accountid='2772378';

 accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time               | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1       | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
   2772378 |     null |   0x01 |      0x02 |        0x | 0x01 |     null |     null |     0x |              0x |     null |          0x |  0x02 |     null |     null |      0x |     null |                0x07 |                0x07 |     null |         0x |     0x |     null |          0x |            0x |     null |     null |          0x |     null |     null | 2023-10-27 11:23:01.461000+0000 |            0x |     null |           null |     null |    0x |     null | ["150674","136243","136100"] |                    null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |     null |     null |      0x |      [] |      0x |     null |   0x04 |                null |     0x |          0x |     0x01 |     null |           3 |     null |     null |     null |      0x |     null |     null |     null

(1 rows)

AWS keyspaces

cqlsh> SELECT * FROM stg_hnt_personalcatalog.my_movies WHERE accountid='2772378';

 accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time               | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1       | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
   2772378 |     null |   0x01 |      0x02 |        0x | 0x01 |     null |     null |     0x |              0x |     null |          0x |  0x02 |     null |     null |      0x |     null |                0x07 |                0x07 |     null |         0x |     0x |     null |          0x |            0x |     null |     null |          0x |     null |     null | 2023-10-27 11:23:01.461000+0000 |            0x |     null |           null |     null |    0x |     null | ["150674","136243","136100"] |                    null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |     null |     null |      0x |      [] |      0x |     null |   0x04 |                null |     0x |          0x |     0x01 |     null |           3 |     null |     null |     null |      0x |     null |     null |     null

(1 rows)
frozensky commented 7 months ago

confirm UPDATEs and DELETEs are being replicate, testing more tables