Open Cylix opened 3 years ago
@Cylix Thanks for feedback and submitting this issue, I can understand how this can bring value.
I would like to dive a little deeper into the details.
Assuming there was such a metric, what would you like to see here? How long it took to replicate from region 1 to region 2? What would the measurement be? seconds? milliseconds?
If you could create an alert on a such a metric, for example the replication time exceeded your threshold of 1000 seconds, what actions would you take based on that alarm?
Hey @maishsk, thanks for getting back to me
Assuming there was such a metric, what would you like to see here? How long it took to replicate from region 1 to region 2?
Yes, we would like to see how long it is taking to replicate from one region to another. For example, if we are pushing our images to region-1 and are replicating to region-2, region-3, and region-4, we would like to see how long it is taking to replicate to each of these replication regions (region-1 to region-2, region-1 to region-3, and region-1 to region-4).
What would the measurement be? seconds? milliseconds?
Seconds sounds fine since replication is taking a couple of seconds from our observations.
If you could create an alert on a such a metric, for example the replication time exceeded your threshold of 1000 seconds, what actions would you take based on that alarm?
Our main use-case regarding replication is to reduce cross-region data transfer. If we are deploying some services in a region, we are currently pulling the service docker images from the ECR repository in that region.
Thus, our deployment logic will first wait on an image to be replicated in that region before deploying it.
As you can see, there is a tradeoff between deployment time and deployment cost (cross-region ECR data transfer charges). However, if ECR is experiencing increased replication lag, it would make sense for us to download the image from whatever region we pushed to at extra cost, rather than waiting on replication to complete
Community Note
Tell us about your request I'd like to have metrics for visualizing ECR replication lags for different images in different regions.
Which service(s) is this request for? ECR
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? As far as I know, there is currently no SLA regarding ECR replication time, and I didn't find anything in the documentation that clarifies what we should expect.
While most replications seem to be fairly fast (a few seconds), we've encountered occasional issues where replication took longer, especially when replicating from us to eu (e.g.: us-west-2 replicated to eu-central-1/eu-west-1). However, we don't have a good way to visualize what is a typical replication time for our images, whether we encounter replication lag spikes (if so, how frequently? how long? in which regions/images?), or even configure alerting for replication lag issues.
We were excited when https://github.com/aws/containers-roadmap/issues/1193 shipped! While it helps us to hold off deployments in some regions until the ECR API tells us the image has been replicated, it does not improve much our ability to monitor replication lags across images/regions.
Are you currently working around this issue? There does not seem to be a good workaround for us to monitor things.
I guess one way would be to call
aws ecr describe-images
for each region, since this API returnsimagePushedAt
(whereimagePushedAt
seem to contain the time when the image was replicated). We could call this API whenaws ecr describe-image-replication-status
is letting us know that an image was replicated after pushing it. We could then determine the replication like by comparing theimagePushedAt
between the source & destination regions and publish that somewhere (Datadog?).Additional context N/A
Attachments N/A