Azure / azure-cli

Azure Command-Line Interface
MIT License
3.99k stars 2.96k forks source link

race condition in 'az acr import' can lead to 'manifest unknown' error in target registry #29974

Open HenryvanderVegte opened 1 week ago

HenryvanderVegte commented 1 week ago

Describe the bug

When running az acr import like

az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force

to copy the image by tag (e.g. 'latest') from sourceacr to targetacr, there is a race condition when the manifest for the tag in the source registry changes while the az acr import command is in progress.

In that case, the 'az acr import' command completes without any errors. However, docker pull fails with

PS C:\Users> docker pull targetacr.azurecr.io/myimage:latest

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview targetacr.azurecr.io/myimage:latest
Error response from daemon: manifest for targetacr.azurecr.io/myimage:latest not found: manifest unknown: manifest sha256:02f3*** is not found

Looking into the azure ACR I can see the tag + digest:

1

but receive a 404 NotFound error when trying to fetch the manifest:

2

I believe this is the same issue that was described in https://github.com/Azure/azure-cli/issues/21944.

As described in https://github.com/Azure/azure-cli/issues/21944, this is very dangerous if the ACR is used by a kubernetes cluster, since it results in pod startup issues with ImagePullBackoff errors.

Related command

Here's a timeline of all commands that ran to bring the ACR in a bad state:

1) myimage:142506623 with digest 02f3... pushed to source acr and gets tagged with latest

2024-09-24T09:38:19.0032592Z docker push ***/myimage:142506623
2024-09-24T09:38:20.7350441Z az acr import --name sourceacr --source sourceacr.azurecr.io/myimage:142506623 --image sourceacr.azurecr.io/myimage:latest --force --no-wait

2) az acr import to target registry starts

2024-09-24T09:40:11.4813456Z az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force

3) myimage:142506638 with digest 3107... pushed to source acr and tagged with latest

2024-09-24T09:40:39.2704699Z docker push ***/myimage:142506638 
2024-09-24T09:40:40.7808698Z az acr import --name sourceacr --source sourceacr.azurecr.io/myimage:142506638 --image sourceacr.azurecr.io/myimage:latest --force --no-wait

4) az acr import to target registry completes

2024-09-24T09:41:42.3197203Z INFO: ===> Completed in 90.84s: [az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force]

The az acr import in 4) completes without any errors, but from that time on the target registry is in a bad state.

Probably does not make a difference, but we're using a PullToken to connect to the source registry when transferring the image like

az acr import --name targetacr --source sourceacr.azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

Errors

docker pull on target acr fails with:

PS C:\Users> docker pull targetacr.azurecr.io/myimage:latest

What's next:
    View a summary of image vulnerabilities and recommendations → docker scout quickview targetacr.azurecr.io/myimage:latest
Error response from daemon: manifest for targetacr.azurecr.io/myimage:latest not found: manifest unknown: manifest sha256:02f3*** is not found

az acr import to copy the image from target acr to a different acr fails with:

az acr import --name testacr --source targetacr .azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

(InvalidParameters) Operation registries-*** failed. Resource /subscriptions/***/resourceGroups/***/providers/Microsoft.ContainerRegistry/registries/testacr Invalid message NotFound Not Found {"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest sha256:02f3*** is not found","detail":{"Name":"myimage","Revision":"sha256:02f3***"}}]}

Code: InvalidParameters
Message: Operation registries-*** failed. Resource /subscriptions/***/resourceGroups/***/providers/Microsoft.ContainerRegistry/registries/testacr Invalid message NotFound Not Found {"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest sha256:02f3*** is not found","detail":{"Name":"myimage","Revision":"sha256:02f3***"}}]}

Issue script & Debug output

Captured debug output via

az acr import --debug --name testacr --source targetacr .azurecr.io/myimage:latest --image myimage:latest --force --password *** --username myPullToken

but afraid that it might contain sensitive information. Will provide if required.

Expected behavior

az acr import should leave the registry in a consistent state. it should either use the old or the new tag, and keep the corresponding manifest.

If the image associated with 'latest' changes while the command is running, it should either: 1) fail the az acr import command and not update anything 2) update the acr with the image that was associated with 'latest' when update started 3) update the acr with the new 'latest' image

Environment Summary

azure-cli 2.245.5

Additional context

No response

azure-client-tools-bot-prd[bot] commented 1 week ago

Hi @HenryvanderVegte,

2.245.5 is not the latest Azure CLI(2.64.0).

If you haven't already attempted to do so, please upgrade to the latest Azure CLI version by following https://learn.microsoft.com/en-us/cli/azure/update-azure-cli.

yonzhan commented 1 week ago

Thank you for opening this issue, we will look into it.

github-actions[bot] commented 1 week ago

Here are some similar issues that might help you. Please check if they can solve your problem.

microsoft-github-policy-service[bot] commented 1 week ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @toddysm, @luisdlp, @northtyphoon.

kichalla commented 3 days ago

cc @nathana1