Allow filtering non-PK timestamp column

aws-samples / cql-replicator

CQLReplicator is a migration tool that helps you to replicate data from Cassandra to AWS Services

Apache License 2.0

16 stars 8 forks source link

Allow filtering non-PK timestamp column #166

Open jlewis-spotnana opened 3 weeks ago

jlewis-spotnana commented 3 weeks ago

Is your feature request related to a problem? Please describe. We have a large dataset in AWS Keyspaces. We wish to migrate only data created in the last 60 days, but client-side timestamps are disabled so we cannot use the --start-replication-from feature of cqlreplicator.

We have a timestamp column is our table and wish to filter based on that timestamp column. The filterExpression feature cannot be used since our timestamp column is not part of the Primary Key.

What other options are available to us in this case? Would it be easy to modify cqlreplicator to support such filtering? Our source and target are both AWS Keyspaces.

Describe the solution you'd like We'd like to filter the dataset based on a timestamp column which is not part of the Primary Key.

Describe alternatives you've considered Copy all data using cqlreplicator, then DELETE unwanted data post-migration using a spark script.

Additional context none

nwheeler81 commented 3 weeks ago

@jlewis-spotnana here is an example, what it might look like:

--json-mapping '{
    "keyspaces": {
      "transformation": {        
      "enabled": true,
      "addNonPrimaryKeyColumns": ["timestamp_column"],        
      "filterExpression": "timestamp_column > to_timestamp('2023-05-01 00:00:00')" 
      }
    }
  }'

jlewis-spotnana commented 3 weeks ago

Thanks! I will give this a try and report my results.

EDIT:

@nwheeler81 I do not see addNonPrimaryKeyColumns in the cqlreplicator code base. When supplying this JSON mapping, what is actually happening? How does it work?

EDIT 2:

If you're proposing this as a possible not-yet-implemented solution, then yes i think that syntax would work fine. We need to filter by a cassandra 'timestamp' data type.

How long would it take to implement such a feature?

nwheeler81 commented 3 weeks ago

@jlewis-spotnana, this is just my proposal, well, might take a couple of weeks before I will release it

jlewis-spotnana commented 3 weeks ago

@nwheeler81 Thanks for your consideration of this feature request! I'll probably have to continue my PROD migration w/o this feature, due to timeline constraints on my end.

Let me know if you need a tester. 🙂

nwheeler81 commented 2 weeks ago

@jlewis-spotnana ack, btw, I will release CR this week and you can try the filtering option.

nwheeler81 commented 1 week ago

@jlewis-spotnana here is a branch.

There is a branch with several major changes:

A filter feature for non-primary keys has been added. For example, you can now use: --json-mapping '{"keyspaces": {"transformation": { "enabled": true, "addNonPrimaryKeyColumns": ["col2"], "filterExpression": "col2 > timestamp \"2024-11-12 00:00:59.588000+0000\"" }}}'
Read and write traffic against Amazon Keyspaces can be controlled using --max-wcu-traffic 1000 with or without --tiles, but still it works in the old way. I would recommend to read the updated README.MD
Auto-recovery has been implemented. If Glue jobs are killed, there's no need for manual fixes. The CQLReplicator will automatically detect and resolve the issue.
Performance and cost efficiency have been improved by using unlogged batch. However, for large partitions/rows, disabling batching and using single inserts may be preferable to avoid throttling issues in Amazon Keyspaces. This can be achieved with: --json-mapping '{"keyspaces": {"writeConfiguration": { "maxStatementsPerBatch": 1}}}'