awslabs / amazon-s3-data-replication-hub-plugin

The Amazon S3 Transfer Plugin for Data Transfer Hub(https://github.com/awslabs/data-transfer-hub). Transfer objects from S3(in other partition), Alibaba Cloud OSS, Tencent COS, Qiniu Kodo into Amazon S3.
Apache License 2.0
47 stars 16 forks source link

Keep getting 403 Error after access keys rotation #87

Open shikunwei opened 2 years ago

shikunwei commented 2 years ago

To Reproduce

  1. Follow the DEPLOYMENT_EN.md to install the S3 replication plugin and verify that it works fine.
  2. Create new access keys and delete the old access keys.
  3. Update the latest AK/SK values to the secret from step 1.
  4. Make some changes to the source bucket, and you will notice that the replication stops working.
  5. Go to the log of the instance, and you will see an error log like this:
    
    2022/08/01 03:16:17 S3> Got an error uploading file - operation error S3: PutObject, https response error StatusCode: 403, RequestID: xxxxxxxx, HostID: xxxxxxxx, api error InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.

2022/08/01 03:16:17 ----->Transferred 1 object xxxxxxxx/xxxxxxxx.json with status ERROR


7. Terminate the active instance in ASG, and wait for the new instance to be ready. Then you will see the replication works fine again. 

So the cache of outdated credentials in the instance caused this problem.

**Expected behavior**
After a few failed attempts, the instance should try to pull the latest credentials from the secret manager instead of keep trying with outdated credentials in the cache.

**Please complete the following information about the solution:**
- [X] Version: 
(SO8002) - Data Transfer Hub - S3 Plugin - Template version v1.0.0
- [ ] Region: Any
- [ ] Was the solution modified from the version published on this repository? 
No
- [ ] If the answer to the previous question was yes, are the changes available on GitHub?
- [X] Have you checked your [service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) for the services this solution uses?
Yes, it's not relevant
- [X] Were there any errors in the CloudWatch Logs?
Yes. Please see above in the Reproduce section.

**Screenshots**
None

**Additional context**
None
shikunwei commented 2 years ago

We do IAM key rotation monthly for security reasons.

The brief steps of the rotation are:

  1. First IAM AK/SK is created and will be used for about one month.
  2. On the 30th day, we will create the second pair of AK/SK, and update the secret manager with the second AK/SK. At this point, both the first AK/SK and the second AK/SK are active and can be used.
  3. On the 37th day, we will make the first AK/SK inactive, so only the second AK/SK is active and can be used after that.
  4. On the 44th day, we will delete the first AK/SK.

This amazon-s3-data-replication-hub-plugin works fine during steps one and two when the first AK/SK is active and can be used in these 37 days.

But on the 38th day, it stops working and throw this error: "2022/08/01 03:16:17 S3> Got an error uploading file - operation error S3: PutObject, https response error StatusCode: 403, RequestID: xxxxxxxx, HostID: xxxxxxxx, api error InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records."

It seems that the worker node is caching the AK/SK in the secret manager from day 1 and not updating the cache after that. Although we updated the secret manager with the second AK/SK on day 30, the cached AK/SK won't be updated. From day 38, after the first key is inactive, the plugin will start getting 403 errors and still keeps trying with the outdated AK/SK in its cache.

If the worker node has some cache update mechanism, the issue should be resolved.

Could you please take a look at this issue?

YikaiHu commented 2 years ago

Hi @shikunwei , sorry for the late reply.

Thanks for reporting this issue to us, Data Transfer Hub doesn't support auto-rotated access key and we are trying to find out a way to support this scenario.

Here we provide a workaround:

You can using Event Bridge rule and a Lambda function to terminate all the active worker instance when your source access key is rotated.