aws-samples / cql-replicator

CQLReplicator is a migration tool that helps you to replicate data from Cassandra to AWS Services
Apache License 2.0
16 stars 8 forks source link

[CQLReplicator on Glue] only insert of new row being replicate, update and delete don't get replicate #79

Closed frozensky closed 9 months ago

frozensky commented 10 months ago

I am testing out the tool to migrate data from datacenter cassandra to AWS keyspaces. I only see insert new rows are being replicate into keyspaces, both update and delete didn't get replicate

my test cases 1, insert new row - replicated 2, update row - not replicate 3, delete row - not replicate 4, stop cqlreplicator, insert row and update the same row (2 steps), start cqlreplicator - both replicated

nwheeler81 commented 10 months ago

[UPDATE op] in order to enable updates in the target table, you need to add --writetime-column regular_column parameter to your run command, e.g., ./cqlreplicator --state run --tiles 2 --writetime-column reg_column --landing-zone s3://cql-replicator-1234567890-us-east-1 --src-keyspace ks1 --src-table tbl1 --trg-keyspace ks1 --trg-table tbl1 --region us-east-1. Note: regular column is not part of the primary key, and should be updated, currently cqlreplicator doesn't support multiple regular columns. Before re-running cleanup and executeinit againt. [DELETE op] currently cqlreplicator doesn't support DELETEs, but this feature will be available in Feb 2024.

frozensky commented 10 months ago

thank you for the detail response we have multiple regular columns that the service/app update frequently, any suggest or options that we can try ?

nwheeler81 commented 10 months ago

Do you have one or more columns in your source table that are constantly updated, e.g., status, dt_updated, and etc. if you have, use one of them to replicate updates --writetime-column dt_updated or --writetime-column status, cqlreplicator updates all columns in your target table if your writetime-column is changed.

frozensky commented 10 months ago

I am testing the --writetime-column and looks like the discovery job ran ok, but the replicator job error out with the following error

2024-01-26 02:03:30,080 ERROR [spark-listener-group-shared] glueexceptionanalysis.GlueExceptionAnalysisListener (Logging.scala:logError(9)): [Glue Exception Analysis] 
{
    "Event": "GlueExceptionAnalysisTaskFailed",
    "Timestamp": 1706234610074,
    "Failure Reason": "Error decoding JSON value for adventure: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: ",
    "Stack Trace": [

I see the proxy file name is null

        {
            "Declaring Class": "com.sun.proxy.$Proxy71",
            "Method Name": "execute",
            "File Name": null,
            "Line Number": -1
        },
nwheeler81 commented 10 months ago

is that the binary column? Could you share your masked table schema?

frozensky commented 10 months ago

here is the table schema from source

cqlsh> describe table personalcatalog.my_movies;

CREATE TABLE personalcatalog.my_movies (
    accountid text PRIMARY KEY,
    a_bucket blob,
    action blob,
    adventure blob,
    animation blob,
    avod blob,
    b_bucket blob,
    c_bucket blob,
    comedy blob,
    crime_thrillers blob,
    d_bucket blob,
    documentary blob,
    drama blob,
    e_bucket blob,
    f_bucket blob,
    fantasy blob,
    g_bucket blob,
    greater_than_60_pct blob,
    greater_than_80_pct blob,
    h_bucket blob,
    historical blob,
    horror blob,
    i_bucket blob,
    independent blob,
    international blob,
    j_bucket blob,
    k_bucket blob,
    kids_family blob,
    l_bucket blob,
    m_bucket blob,
    modification_time timestamp,
    music_musical blob,
    n_bucket blob,
    numeric_bucket blob,
    o_bucket blob,
    other blob,
    p_bucket blob,
    purchase_time_bucket_1 text,
    purchase_time_bucket_10 text,
    purchase_time_bucket_2 text,
    purchase_time_bucket_3 text,
    purchase_time_bucket_4 text,
    purchase_time_bucket_5 text,
    purchase_time_bucket_6 text,
    purchase_time_bucket_7 text,
    purchase_time_bucket_8 text,
    purchase_time_bucket_9 text,
    q_bucket blob,
    r_bucket blob,
    reality blob,
    rentals text,
    romance blob,
    s_bucket blob,
    sci_fi blob,
    special_char_bucket blob,
    sports blob,
    superheroes blob,
    suspense blob,
    t_bucket blob,
    total_count int,
    u_bucket blob,
    v_bucket blob,
    w_bucket blob,
    western blob,
    x_bucket blob,
    y_bucket blob,
    z_bucket blob
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99PERCENTILE';
nwheeler81 commented 10 months ago

I reproduced the error in CQLSH if the blob value is empty: localhost@cqlsh:ks> create table test1 (pk int PRIMARY KEY, bin blob ) ; localhost@cqlsh:ks> insert into test1 JSON '{"pk": 1, "bin":""}'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Error decoding JSON value for bin: Value '' is not a valid blob representation: String representation of blob is missing 0x prefix: "

Empty string in binary should be like this insert into test1 JSON '{"pk": 1, "bin":"0x"}'

I think the problem in C JSON serializer: `select json from test1 [json] ---------------------- {"pk": 1, "bin": ""}`

nwheeler81 commented 10 months ago

@frozensky Added support for DELETEs and fixed the issue with JSON serialization with "0x" empty blob columns. Could you please validate. 1/ Run --state cleanup 2/ Clone the git repository 3/ Run --state run 4/ Validate --state stats

frozensky commented 10 months ago

Thank you for the quick turn around, I see the replicator running now I am going to replicate more tables and run some test for the update and delete.

./cqlreplicator --state run --tiles 2 --writetime-column modification_time --landing-zone s3://hnt-cqlreplicator-stg --region us-west-1 --src-keyspace personalcatalog --src-table my_movies --trg-keyspace stg_hnt_personalcatalog --trg-table my_movies --inc-traffic
·······································································
[2024-01-27T16:37:43-08:00] OS: Darwin
[2024-01-27T16:37:43-08:00] Incremental traffic for the historical workload is enabled
[2024-01-27T16:37:43-08:00] Incremental period: 240 seconds
[2024-01-27T16:37:43-08:00] Starting discovery process...
[2024-01-27T16:37:43-08:00] TILES: 2
[2024-01-27T16:37:43-08:00] SOURCE: personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] TARGET: stg_hnt_personalcatalog.my_movies
[2024-01-27T16:37:43-08:00] LANDING ZONE: s3://hnt-cqlreplicator-stg
[2024-01-27T16:37:43-08:00] WRITE TIME COLUMN: modification_time
[2024-01-27T16:37:43-08:00] TTL COLUMN: None
[2024-01-27T16:37:43-08:00] ROWS PER DPU: 250000
[2024-01-27T16:37:43-08:00] Checking if the discovery job is already running...
{

    "DeleteMarker": true,
    "VersionId": "null"
}
[2024-01-27T16:37:45-08:00] Starting the discovery job...
[2024-01-27T16:39:26-08:00] Average primary keys per tile is 127117
{
    "DeleteMarker": true,
    "VersionId": "null"
}
{tarting Glue Jobs : [||||||||||||||||||||--------------------] 50.00%
    "DeleteMarker": true,
    "VersionId": "null"
}
Starting Glue Jobs : [||||||||||||||||||||||||||||||||||||||||] 100.00% - COMPLETED
[2024-01-27T16:47:34-08:00] Started jobs: jr_18cdce4b61f938848e2d005db7c103d044045990d2d2c9fe5285f4b7831b9b91 jr_3edc5591f56a7b64aff12f06e2e805d82d855f4e497956027656c7f63c3eeea5 jr_e945898cbe3b6bd5aa43b4b3f78371379d53b779dbf1f0c4ef6d09ff96b7d72b
$ ./cqlreplicator --state stats --landing-zone s3://hnt-cqlreplicator-stg --src-keyspace personalcatalog --src-tabl
e my_movies
·······································································
[2024-01-27T16:47:57-08:00] OS: Darwin
[2024-01-27T16:48:05-08:00] Discovered rows in personalcatalog.my_movies is 254274
[2024-01-27T16:48:05-08:00] Replicated rows in ks_test_cql_replicator.test_cql_replicator is 0

from source cassandra

cqlsh> SELECT * FROM personalcatalog.my_movies WHERE accountid='2772378';

 accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time               | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1       | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
   2772378 |     null |   0x01 |      0x02 |        0x | 0x01 |     null |     null |     0x |              0x |     null |          0x |  0x02 |     null |     null |      0x |     null |                0x07 |                0x07 |     null |         0x |     0x |     null |          0x |            0x |     null |     null |          0x |     null |     null | 2023-10-27 11:23:01.461000+0000 |            0x |     null |           null |     null |    0x |     null | ["150674","136243","136100"] |                    null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |     null |     null |      0x |      [] |      0x |     null |   0x04 |                null |     0x |          0x |     0x01 |     null |           3 |     null |     null |     null |      0x |     null |     null |     null

(1 rows)

AWS keyspaces

cqlsh> SELECT * FROM stg_hnt_personalcatalog.my_movies WHERE accountid='2772378';

 accountid | a_bucket | action | adventure | animation | avod | b_bucket | c_bucket | comedy | crime_thrillers | d_bucket | documentary | drama | e_bucket | f_bucket | fantasy | g_bucket | greater_than_60_pct | greater_than_80_pct | h_bucket | historical | horror | i_bucket | independent | international | j_bucket | k_bucket | kids_family | l_bucket | m_bucket | modification_time               | music_musical | n_bucket | numeric_bucket | o_bucket | other | p_bucket | purchase_time_bucket_1       | purchase_time_bucket_10 | purchase_time_bucket_2 | purchase_time_bucket_3 | purchase_time_bucket_4 | purchase_time_bucket_5 | purchase_time_bucket_6 | purchase_time_bucket_7 | purchase_time_bucket_8 | purchase_time_bucket_9 | q_bucket | r_bucket | reality | rentals | romance | s_bucket | sci_fi | special_char_bucket | sports | superheroes | suspense | t_bucket | total_count | u_bucket | v_bucket | w_bucket | western | x_bucket | y_bucket | z_bucket
-----------+----------+--------+-----------+-----------+------+----------+----------+--------+-----------------+----------+-------------+-------+----------+----------+---------+----------+---------------------+---------------------+----------+------------+--------+----------+-------------+---------------+----------+----------+-------------+----------+----------+---------------------------------+---------------+----------+----------------+----------+-------+----------+------------------------------+-------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------+----------+---------+---------+---------+----------+--------+---------------------+--------+-------------+----------+----------+-------------+----------+----------+----------+---------+----------+----------+----------
   2772378 |     null |   0x01 |      0x02 |        0x | 0x01 |     null |     null |     0x |              0x |     null |          0x |  0x02 |     null |     null |      0x |     null |                0x07 |                0x07 |     null |         0x |     0x |     null |          0x |            0x |     null |     null |          0x |     null |     null | 2023-10-27 11:23:01.461000+0000 |            0x |     null |           null |     null |    0x |     null | ["150674","136243","136100"] |                    null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |                   null |     null |     null |      0x |      [] |      0x |     null |   0x04 |                null |     0x |          0x |     0x01 |     null |           3 |     null |     null |     null |      0x |     null |     null |     null

(1 rows)
frozensky commented 10 months ago

confirm UPDATEs and DELETEs are being replicate, testing more tables