Intermittent issue with s3 replication plugin

sushantbhat commented 3 years ago

Noticed that sometimes the messages from the SQS queue are not consumed by the JobWorker and stayed waiting for a few days. When a new item was added to s3, event got triggered and the message was added to the queue and no. of messages available got incremented and the waiting job count in the cloudwatch dashboard got incremented as well. However, it stayed in waiting state and never got transferred. Only after terminating the ec2 instance and a new instance got launched it resumed transferring the objects. Apart from the s3 trigger, even the hourly job got affected as well due to this and it wasn't able poll the source and destination buckets completely. This was the error in the log of job finder. _2021-05-24T12:51:47.982+05:30 2021/05/24 07:21:47 Start running Finder job 2021-05-24T12:51:48.336+05:30 2021/05/24 07:21:48 Get Parameter Value of chinakeys from SSM 2021-05-24T12:51:48.782+05:30 2021/05/24 07:21:48 Queue DTHS3Stack-S3TransferQueue-7QIV4XZ7OKM1 has 0 not visible message(s) and 3 visable message(s) 2021-05-24T12:51:48.782+05:30 2021/05/24 07:21:48 Queue might not be empty or Unknown error... Please try again later

To Reproduce The issue is intermittent, unable to reproduce.

Please complete the following information about the solution:

[ ] Version: v2.0.0
[ ] Region: [e.g. us-west-2]
[ ] Was the solution modified from the version published on this repository? No (only the cron job frequency was changed to daily from hourly)
[ ] If the answer to the previous question was yes, are the changes available on GitHub?
[ ] Have you checked your service quotas for the sevices this solution uses?
[ ] Were there any errors in the CloudWatch Logs? Yes, logs attached above.

YikaiHu commented 3 years ago

Hi @sushantbhat , Could you check the EC2 JobWorker's log in CloudWatch Log groups when the SQS jobs ignored by the workers? The log groups' name is like 'xxxxxxWorkerLogGroupxxxxxxxx' B, R

sushantbhat commented 3 years ago

@YikaiHu Seems like 'xxxxxxWorkerLogGroupxxxxxxxx' wasn't logging anything. There are no log streams after May-03 until it was restarted a couple of days back. And even the logs before that are empty now not even the normal logging that says 'No messages, sleep...'.

YikaiHu commented 3 years ago

@sushantbhat
For issue 1: "However, it stayed in waiting state and never got transferred."

Have you ever restarted the EC2 Workers before this issue occur? It seems like the transfer cli in your EC2 Workers shut down for some reason. If this case occur, you have to terminate these 'dead' Workers, and Auto Scaling Groups will launch new Workers to consumption the jobs in SQS.

For issue 2: "This was the error in the log of job finder .... Queue might not be empty or Unknown error... Please try again later"

This is a normal mechanism, and the Finder is designed to check whether SQS is empty before comparing.

sushantbhat commented 3 years ago

No, I haven't restarted ec2 workers before. Only after this issue happened, ec2 worker was terminated, and like you mentioned it relaunched another instance that started working as expected. However, wanted to know what could have caused JobWorker to fail, as we are planning to go production with this solution and anything that can be done to prevent this from occurring in the future. And also is there any notification mechanism that can be enabled for such cases where JobWorker fails and messages are stuck in waiting state?

YikaiHu commented 3 years ago

@sushantbhat The most efficient method to find out the root cause is to check the EC2 system log in the EC2 console (Actions -> Monitor and troubleshoot -> Get system log) and check whether the JobWorker's CLI is running.

The reasons which can cause JobWorker to fail we met so far:

Customer restart the EC2 Worker instances.
Customer edit the EC2 user data may cause the instance cannot download the CLI successfully

And in this version, we don't have notification mechanism for Worker failure.

awslabs / amazon-s3-data-replication-hub-plugin

Intermittent issue with s3 replication plugin #41