alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
51 stars 14 forks source link

Unable to deploy Azure Container Instances from Docker #1984

Closed jemrobinson closed 2 weeks ago

jemrobinson commented 3 weeks ago

:white_check_mark: Checklist

:computer: System information

:package: Packages

List of packages ```none Paste list of packages here ```

:no_entry_sign: Describe the problem

SRE deployment fails when pulling images from Docker with the message below.

This is a known issue, tracked here: https://github.com/Azure/azure-cli/issues/29300. It is caused by rate limits from Docker as described here: https://medium.com/@alaa.barqawi/docker-rate-limit-with-azure-container-instance-and-aks-4449cede66dd.

Docker Hub’s rate limiting policy is designed to manage the load on their infrastructure and prevent abuse. Starting June 30, 2024, the following rate limits will be in effect:

  • Anonymous users can pull up to 100 images per 6-hour period.
  • Authenticated users (with a Docker account) can pull up to 200 images per 6-hour period.
  • Users with a paid Docker license can pull up to 5,000 images per day.

These limits apply to both individual users and automated processes, such as container orchestration systems like AKS and ACI. Exceeding these limits can result in your Kubernetes cluster being unable to pull new container images, leading to issues with application restarts, deployments, and scaling

Guidance from Microsoft is here: https://techcommunity.microsoft.com/t5/apps-on-azure-blog/best-practices-for-using-azure-container-registry-and-docker-hub/ba-p/4068979

~It's currently unclear to me whether the "100 images per 6-hour period." limit applies per ACI, per subscription or for the whole of Azure?~ It's per IP address. Does Azure use a different IP address for each ACI? I doubt it.

Here are instructions for how to authenticate from ACI

:deciduous_tree: Log messages

Relevant log messages ``` azure-native:containerinstance:ContainerGroup (sre_dns_server_container_group): error: autorest/azure: Service returned an error. Status= Code="RegistryErrorResponse" Message="An error response is received from the docker registry 'index.docker.io'. Please retry later." ```

:recycle: To reproduce

JimMadge commented 3 weeks ago

😱

jemrobinson commented 3 weeks ago

Some possible solutions:

  1. Azure container registry + artifact cache

    • copy images from Dockerhub into Azure, pull them from Azure. N.B. this doesn't avoid rate limiting issues
  2. Use Dockerhub credentials to request images

    • limit is 600 requests every 6 hours which shouldn't be a problem
    • not totally sure how to integrate this into the ACI workflow
  3. Look for other sources of images e.g. quay.io or ghcr.io

    • likely that not all images will exist away from Dockerhub
  4. Create our own GitHub repo for each Docker image that simply pulls and republishes

JimMadge commented 3 weeks ago

Long term, I like the idea of holding images 'closer' to our infrastructure. Less wasteful for us to fetch each image once and keep it in the SHM (probably?).

jemrobinson commented 3 weeks ago

Agreed - slightly worried that we'd still need to use Dockerhub credentials to get the image into the SHM in the first place though.