aws-samples / cql-replicator

CQLReplicator is a migration tool that helps you to replicate data from Cassandra to AWS Services
Apache License 2.0
15 stars 8 forks source link

Replication jobs not stopping gracefully #136

Open mati999q opened 6 months ago

mati999q commented 6 months ago

Describe the bug Sometimes when the job is finished and we want to stop the job, we use the request-stop command provided. At times, this does not stop all the replication workers (and sometimes not even the discovery job itself).

This has cost implications especially for replication jobs with more workers, and the only working way to stop the jobs fully is to cancel them, skewing metrics.

To Reproduce Steps to reproduce the behavior:

  1. Run any CQL Replicator job (ideally one with more tiles)
  2. Try to stop the run with the request-stop command

Expected behavior A clear and concise description of what you expected to happen.

All jobs should stop when the job is requested to stop

Screenshots If applicable, add screenshots to help explain your problem.

In this case, with a job running with 8 replication jobs - 4 of them + discovery were stopped correctly, 4 would only stop by cancelling them - or after numerous retries of the request-stop command (and one still did not succeed - note some jobs stopped 10 mins later)

image

image

Additional context Add any other context about the problem here.

nwheeler81 commented 6 months ago

Hi @mati999q thanks for reporting the issue, could you please provide more information regarding it: 1/ Open Amazon Keyspaces Console, click on tables, find your table, and clink on "monitor" tab. Find "Write units per second" dashboard, and set 04/04/24 9:45 - 10:00 your local time (or when you issued the stop requests). Do you observe the write traffic against the table? 2/ Open AWS Glue Console, clink on one of the unstopped Glue jobs, from "run details" click on error logs", look for errors, but disregard: Could not initialize class com.amazonaws.services.glue.util.StringToBoolean$ and Unable to load JNR native implementation. This could be normal if JNR is excluded from the classpath java.lang.NoClassDefFoundError: jnr/posix/POSIXHandler

Note that the CQLRepicator stops eventually not immediately after issuing stop requests because it might require some time to complete the current task to keep track of snapshots in order.

mati999q commented 6 months ago

1.) No traffic for around 90minutes prior to requesting a stop image

2.) No errors found in the jobs that fit the time of the stops, though logs still active with new tasks assigned to replication jobs

nwheeler81 commented 6 months ago

@mati999q is it possible to reproduce the issue in your environment again? You can supply the run command with --cr to cleanup the ledger. In addition, I've noticed that you use G.2X instead of G.1X are you facing OOMs issues on Glue side?

mati999q commented 6 months ago

It is difficult to reliably replicate this bug, but it definitely happens more often than not (especially with a higher number of workers per replication job). It happened quite often with the 100gb data set testing, and since it does not return an error in the terminal (if the job is not stopped), the only way to see it is on the Glue console (+ jobs keep running in the meantime, and with a high number of workers = more costs). For reference the job described above was using override-rows-per-worker 10000000 due to the number of rows in the source.

Seems very inconsistent from our numerous runs, most of the time it works and jobs stop in ~1min, but sometimes does not and have to retry many times - would be nice to get a confirmation once job stopped in the terminal if possible.

We were getting OOM issues with G.1X on the mentioned 100gb testing, works fine after switching to G.2X - no issues faced yet which would make us go to G.4X (but might have to in the future / increase number of workers in the discovery job)

jlewis-spotnana commented 1 month ago

I'm also facing this issue. In my case, I have two workers, but as you can see below the cqlreplicator script only attempts to stop 1 of them.

I have resorted to killing the jobs manually in AWS Glue, but this apparently causes corruption of cqlreplicator's internal state, making it impossible to restart the job without manual intervention.

% ./cqlreplicator --state request-stop --region us-west-2 \
  --landing-zone s3://my-lz-bucket-name \
  --src-keyspace XXX --src-table YYY \
  --trg-keyspace ZZZ --trg-table YYY
    ___ ___  _     ____            _ _           _
  / ___/ _ \| |   |  _ \ ___ _ __ | (_) ___ __ _| |_ ___  _ __
 | |  | | | | |   | |_) / _ \ '_ \| | |/ __/ _` | __/ _ \| '__|
 | |__| |_| | |___|  _ <  __/ |_) | | | (_| (_| | || (_) | |
  \____\__\_\_____|_| \_\___| .__/|_|_|\___\__,_|\__\___/|_|
                            |_|
·······································································
:     __          _______   _____           _____                     :
:    /\ \        / / ____| |  __ \         / ____|                    :
:   /  \ \  /\  / / (___   | |__) | __ ___| (___   ___ _ ____   _____ :
:  / /\ \ \/  \/ / \___ \  |  ___/ '__/ _ \\___ \ / _ \ '__\ \ / / _ \:
: / ____ \  /\  /  ____) | | |   | | | (_) |___) |  __/ |   \ V /  __/:
:/_/    \_\/  \/  |_____/  |_|   |_|  \___/_____/ \___|_|    \_/ \___|:
·······································································
[2024-08-22T10:17:38-07:00] OS: Darwin
[2024-08-22T10:17:38-07:00] AWS CLI: aws-cli/2.15.23 Python/3.11.6 Darwin/23.5.0 exe/x86_64 prompt/off
[2024-08-22T10:17:38-07:00] Requested a stop for the discovery job
[2024-08-22T10:17:39-07:00] Requested a stop for the replication tile: 0

EDIT: Manually putting a stopRequested file in replication/1/ causes the second tile to exit with SUCCESS.

aws s3 cp stopRequested s3://my-lz-bucket-name/XXX/YYY/replication/1/
nwheeler81 commented 1 month ago

Do you think it is a good idea to add a flag for the recovery process? Cheers,NikolaiOn Aug 22, 2024, at 1:25 PM, Jeffrey Lewis @.***> wrote: I'm also facing this issue. In my case, I have two workers, but as you can see below the cqlreplicator script only attempts to stop 1 of them. I have resorted to killing the jobs manually in AWS Glue, but this apparently causes corruption of cqlreplicator's internal state, making it impossible to restart the job without manual intervention. % ./cqlreplicator --state request-stop --region us-west-2 \ --landing-zone s3://my-lz-bucket-name \ --src-keyspace XXX --src-table YYY \ --trg-keyspace ZZZ --trg-table YYY


/ / | | | \ _ | () | | __ | | | | | | | | |) / \ '_ | | |/ / _` | _/ | '| | || || | || _ < / |) | | | (| (| | || () | | ____|| __| ./|||_\,|___/|| || ······································································· : __ : : /\ \ / / __| | \ / __| : : / \ \ /\ / / ( | |) | | ( __ __ : : / /\ \ \/ \/ / \ \ | _/ '_/ \_ \ / _ \ '\ \ / / \: : / \ /\ / ) | | | | | | () |) | / | \ V / /: :// _\/ \/ |___/ || || _/__/ \|| \/ ___|: ······································································· [2024-08-22T10:17:38-07:00] OS: Darwin [2024-08-22T10:17:38-07:00] AWS CLI: aws-cli/2.15.23 Python/3.11.6 Darwin/23.5.0 exe/x86_64 prompt/off [2024-08-22T10:17:38-07:00] Requested a stop for the discovery job [2024-08-22T10:17:39-07:00] Requested a stop for the replication tile: 0

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were assigned.Message ID: @.***>

jlewis-spotnana commented 1 month ago

Do you think it is a good idea to add a flag for the recovery process?

By "recovery process" do you mean copying 'head' to 'tail' as described in https://github.com/aws-samples/cql-replicator/issues/156#issuecomment-2305542717 ? If so, yes that would be useful.