Azure / acr

Azure Container Registry samples, troubleshooting tips and references
https://aka.ms/acr
Other
164 stars 114 forks source link

ACR streaming: failed to open remote file as tar file error #763

Closed alexp-openai closed 4 months ago

alexp-openai commented 5 months ago

Describe the bug I'm evaluating ACR streaming preview, and hit a problem where my container cannot start when streaming is enabled.

kubelet is repeatedly logging errors like that:

Normal   Pulling    8s (x2 over 41s)  kubelet            Pulling image "xxx.azurecr.io/test-alexp-jupyter-datascience-notebook:latest"
Warning  Failed     8s                kubelet            Error: failed to create containerd container: failed to attach and mount for snapshot 218: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_218, failed:failed to open remote file as tar file /https://xxx.azurecr.io/v2/test-alexp-jupyter-datascience-notebook/blobs/sha256:f16ce562223807a933f8040b1c3ce2a617377e7f160826980d7f8c6fcc84bb2f: No such file or directory: unknown

It's interesting that there is a slash in front of "http" for the docker image url.

To Reproduce Steps to reproduce the behavior:

  1. I followed the instructions from https://medium.com/@rammadasu5/how-to-enable-artifact-streaming-on-your-aks-node-pools-to-stream-artifacts-from-acr-and-reduce-64bc22ba9788 , and used existing ACR registry and AKS node pool.
  2. Create a new deployment from an ACR copy of public jupyter-datascience-notebook image that has streaming enabled
  3. Container cannot start with CreateContainerError error and the error message above.

Expected behavior Container should start

Screenshots If applicable, add screenshots to help explain your problem.

Any relevant environment information

kubectl version                  
Client Version: v1.30.1
Server Version: v1.28.9

AKS cluster was deployed a few days ago and is on the latest version for the control plane and node pool.

AKS node info:

System Info:
  Machine ID:                 229240f927f1457daabe410ed4f53257
  System UUID:                3f61de23-34b4-4744-83a1-182c5ce28e9d
  Boot ID:                    bdc6a038-823b-4186-80d0-b44b37a0ec47
  Kernel Version:             5.15.0-1064-azure
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.15-1
  Kubelet Version:            v1.28.9

Additional context Add any other context about the problem here.

If any information is a concern to post here, you can create a support ticket or send an email to acrsup@microsoft.com.

juliusl commented 5 months ago

Thanks for reporting this, I'm taking a look on our side --

It's interesting that there is a slash in front of "http" for the docker image url.

Did some digging and that appears to be normal.

estebanreyl commented 5 months ago

Hello, @alexp-openai I tried to repro earlier today but was unable to get the same error. I tried using both the latest jupyter/datascience-notebook from docker and the latest from quay.io but was not able to repro. I do notice that the image in question I converted is not identical to the one you have converted by looking at acr logs so there may be something further in there. Would you be able to provide us with a more specific image (using a fixed tag or digest) that you are aware fails for us to verify with? We are committed to making sure the service is reliable for all workloads.

As a side note we are in the process of rolling out a new version of the underlying service responsible for conversion so I would suggest trying again in the next couple of days as that rolls out to verify if any of the fixes there affect your scenario. Beyond that I will continue to try reproducing and understanding what may have gone wrong.

alexp-openai commented 5 months ago

So, this was a first public image that I have tried. I just pulled it from public docker hub last week. I can try with another one a bit later. Also maybe there is something wrong with my cluster setup. This AKS cluster was set up last week as well, so versions should be new.

If you have some suggestions on how to troubleshoot it further, feel free to share.

juliusl commented 5 months ago

@alexp-openai If you debug the node w/ kubectl debug nodes/<node-name> -it --image bash (you'll need to do chroot /host when that connects) there are some logs you can collect,

1) overlaybd logs -

estebanreyl commented 4 months ago

Hi @alexp-openai just wanted to check in. Have you continued to encounter the issue? Is there any more info you would like us to take a look at? It might be best to follow up with a support ticket.

estebanreyl commented 4 months ago

Closing since the issue has been open for three weeks with no further input. Please let us know if we can provide further assistance in a support ticket https://azure.microsoft.com/en-us/support/create-ticket/