aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 320 forks source link

[ECR] [request]: CloudWatch metrics for ECR Replication lag #1537

Open Cylix opened 3 years ago

Cylix commented 3 years ago

Community Note

Tell us about your request I'd like to have metrics for visualizing ECR replication lags for different images in different regions.

Which service(s) is this request for? ECR

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? As far as I know, there is currently no SLA regarding ECR replication time, and I didn't find anything in the documentation that clarifies what we should expect.

While most replications seem to be fairly fast (a few seconds), we've encountered occasional issues where replication took longer, especially when replicating from us to eu (e.g.: us-west-2 replicated to eu-central-1/eu-west-1). However, we don't have a good way to visualize what is a typical replication time for our images, whether we encounter replication lag spikes (if so, how frequently? how long? in which regions/images?), or even configure alerting for replication lag issues.

We were excited when https://github.com/aws/containers-roadmap/issues/1193 shipped! While it helps us to hold off deployments in some regions until the ECR API tells us the image has been replicated, it does not improve much our ability to monitor replication lags across images/regions.

Are you currently working around this issue? There does not seem to be a good workaround for us to monitor things.

I guess one way would be to call aws ecr describe-images for each region, since this API returns imagePushedAt (where imagePushedAt seem to contain the time when the image was replicated). We could call this API when aws ecr describe-image-replication-status is letting us know that an image was replicated after pushing it. We could then determine the replication like by comparing the imagePushedAt between the source & destination regions and publish that somewhere (Datadog?).

Additional context N/A

Attachments N/A

maishsk commented 3 years ago

@Cylix Thanks for feedback and submitting this issue, I can understand how this can bring value.

I would like to dive a little deeper into the details.

Assuming there was such a metric, what would you like to see here? How long it took to replicate from region 1 to region 2? What would the measurement be? seconds? milliseconds?

If you could create an alert on a such a metric, for example the replication time exceeded your threshold of 1000 seconds, what actions would you take based on that alarm?

Cylix commented 3 years ago

Hey @maishsk, thanks for getting back to me

Assuming there was such a metric, what would you like to see here? How long it took to replicate from region 1 to region 2?

Yes, we would like to see how long it is taking to replicate from one region to another. For example, if we are pushing our images to region-1 and are replicating to region-2, region-3, and region-4, we would like to see how long it is taking to replicate to each of these replication regions (region-1 to region-2, region-1 to region-3, and region-1 to region-4).

What would the measurement be? seconds? milliseconds?

Seconds sounds fine since replication is taking a couple of seconds from our observations.

If you could create an alert on a such a metric, for example the replication time exceeded your threshold of 1000 seconds, what actions would you take based on that alarm?

Our main use-case regarding replication is to reduce cross-region data transfer. If we are deploying some services in a region, we are currently pulling the service docker images from the ECR repository in that region.

Thus, our deployment logic will first wait on an image to be replicated in that region before deploying it.

As you can see, there is a tradeoff between deployment time and deployment cost (cross-region ECR data transfer charges). However, if ECR is experiencing increased replication lag, it would make sense for us to download the image from whatever region we pushed to at extra cost, rather than waiting on replication to complete