AWS ECR: Could not set registry endpoint credentials ... failed timeout after 10s

bcbrockway commented 11 months ago

Describe the bug

We have the Image Updater running on EKS clusters using IRSA to link them to an IAM role that grants it permissions to our ECR registry. In addition, we have an auth script configured to run an awscli command to grab a new token every 11 hours:

# configmap/argocd-image-updater-config
# ...
data:
  registries.conf: |
    registries:
    - api_url: https://000000000000.dkr.ecr.us-east-2.amazonaws.com
      credentials: ext:/scripts/ecr-login-us-east-2.sh
      credsexpire: 11h
      name: ECR
      prefix: 000000000000.dkr.ecr.us-east-2.amazonaws.com

# configmap/argocd-image-updater-authscripts
# ...
data:
  ecr-login-us-east-2.sh: |
    #!/bin/sh
    aws ecr --region 'us-east-2' get-authorization-token --cli-read-timeout 5 --cli-connect-timeout 5 --output text --query 'authorizationData[].authorizationToken' | base64 -d

This usually works on startup, and sometimes after credsexpire, but it also often fails with:

Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: /scripts/ecr-login-us-east-2.sh failed timeout after 10s

Sometimes this can take hours of retries to rectify and sometimes nothing short of killing the pod and starting a new one will fix it.

It's also weird that it seems to run this script once for each app in its update cycle (see logs below) rather than just running it once seeing as we've configured at the registry level.

To Reproduce Set up as above. Unfortunately, this is intermittent.

Expected behavior The script runs correctly (once) and stores the new token for all apps to use.

Additional context N/A

Version 0.12.0

Logs

2023-12-20T14:11:38+00:00   time="2023-12-20T14:11:38Z" level=info msg="Processing results: applications=4 images_considered=3 images_skipped=1 images_updated=0 errors=1"
2023-12-20T14:11:38+00:00   time="2023-12-20T14:11:38Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
2023-12-20T14:11:39+00:00   time="2023-12-20T14:11:39Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=0 errors=0"
2023-12-20T14:12:09+00:00   time="2023-12-20T14:12:09Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
2023-12-20T14:12:10+00:00   time="2023-12-20T14:12:10Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=0 errors=0"
2023-12-20T14:12:30+00:00   {"log":"time=\"2023-12-20T14:12:30Z\" level=info msg=\"Starting image update cycle, considering 3 annotated application(s) for update\"\n","stream":"stdout","time":"2023-12-20T14:12:30.44525427Z"}
2023-12-20T14:12:31+00:00   {"log":"time=\"2023-12-20T14:12:31Z\" level=info msg=\"Processing results: applications=3 images_considered=2 images_skipped=2 images_updated=0 errors=0\"\n","stream":"stdout","time":"2023-12-20T14:12:31.748061081Z"}
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg="Starting image update cycle, considering 19 annotated application(s) for update"
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=7dddb
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=33a72
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=2a859
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=e6515
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=dce93
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=d9146
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=508d5
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=68554
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=3c106
2023-12-20T14:12:41+00:00   time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=d7263
2023-12-20T14:12:51+00:00   time="2023-12-20T14:12:51Z" level=error msg="`/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" execID=7dddb
2023-12-20T14:12:51+00:00   time="2023-12-20T14:12:51Z" level=error msg="Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: `/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" alias=report-subscription-event-producer application=report-subscription-event-producer image_name=gitlab/mintel/core-services/report-subscription-event-producer image_tag=ebdfe4eccab090c0d5a60a3bd4aae4aa7b8c3ae2-test registry=000000000000.dkr.ecr.us-east-2.amazonaws.com
2023-12-20T14:12:51+00:00   time="2023-12-20T14:12:51Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=b2204
2023-12-20T14:12:51+00:00   time="2023-12-20T14:12:51Z" level=error msg="`/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" execID=2a859
2023-12-20T14:12:51+00:00   time="2023-12-20T14:12:51Z" level=error msg="Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: `/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" alias=ataccama-event-bridge application=ataccama-event-bridge image_name=gitlab/mintel/data-warehouse/agents/reference-data/ataccama-event-bridge image_tag="sha256:80c37d6719f3f2fd3e24a5264e2e1fbf1e37cf06a308f379db88ca55639ae498" registry=000000000000.dkr.ecr.us-east-2.amazonaws.com

PuChenTW commented 10 months ago

Setting --max-concurrency to 1 works for me, although I don't know exactly how this fixes the problem 😅 https://argocd-image-updater.readthedocs.io/en/stable/install/reference/#flags

extraArgs:
  - --max-concurrency
  - "1"

bcbrockway commented 10 months ago

Setting --max-concurrency to 1 works for me, although I don't know exactly how this fixes the problem 😅 https://argocd-image-updater.readthedocs.io/en/stable/install/reference/#flags
extraArgs:
  - --max-concurrency
  - "1"

Some of our ArgoCD instances have a lot of apps so this would slow us down quite a bit :(

tareks commented 3 months ago

This still seems to happen even with --max-concurrency set to 1. Is this still happening to anyone else? Where is the 10s timeout being set and can it be extended?

It's not related to caching invalid token data or something for the lifetime of credsexpire if one call fails or something, is it?

bitgandtter commented 2 months ago

Same here; it connects at the start, then fails when it requires a refresh. Any solution?

aibazhang commented 2 months ago

We just made a new image updater for ECR by ourselves from scratch...

argoproj-labs / argocd-image-updater

AWS ECR: Could not set registry endpoint credentials ... failed timeout after 10s #657