Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[Feature] Artifact Streaming #3928

Open allyford opened 9 months ago

allyford commented 9 months ago

Previously known as Teleport, Artifact Streaming will enable customers to store the container images within a single registry, manage, and stream the container images to Azure Kubernetes Service (AKS) clusters in multiple regions. Artifact Streaming will deploy container application to multiple regions without having to create multiple registries or enable geo-replication.

Azure Container Registry (ACR) and Azure Kubernetes Service (AKS) will soon support artifact streaming. Artifact streaming for AKS provides customers the ability to accelerate containerized workloads in the cloud by dramatically reducing the overall startup time by almost 15% when connected to ACR.

Artifact streaming will empower customers to scale resources on AKS seamlessly by not having to wait for long pull times for each Kubernetes pod. Customers with Linux amd64 container images will be supported and we have plans to support Windows and arm64 container images in the future.

The ACR and AKS team would also like to give a huge thanks to Alibaba for their contributions to the containerd Overlaybd project.

We can’t wait to hear what our customers have to think and look forward to hearing feedback on further improving this feature.

1785

ACR Github Issue

maxiedaniels commented 8 months ago

@allyford is this available for preview yet? I just saw this now and I've been trying to get invited to the teleport preview for ages

allyford commented 8 months ago

Preview coming soon! CLI experience will be available next week. I'll continue to update this Github issue.

kishorerv25 commented 8 months ago

is there any roadmap when this could be available for windows workloads. As most of times , images sizes are larger manily for windows workloads as compared to linux.

allyford commented 7 months ago

is there any roadmap when this could be available for windows workloads. As most of times , images sizes are larger manily for windows workloads as compared to linux.

We're currently working on this feature for windows workloads as well. Targeting 2024.

jabbera commented 7 months ago

Does this work for AzureLinux or only Ubuntu? https://pixelrobots.co.uk/2023/11/first-look-artifact-streaming-in-preview-for-acr-and-aks/ has a comment that it's ubuntu only but I have yet to see official MSFT documentation.

northtyphoon commented 7 months ago

Both AzureLinux (Mariner) and Ubuntu will be supported.

jabbera commented 7 months ago

First test on this did not go so well. Any advice? Ticket: 2311200040008581

  Normal   Pulled                  5m58s                 kubelet                  Successfully pulled image "<SNIP>.azurecr.io/jupyter/datascience-notebook:preview" in 1.078s (1.078s including waiting)
  Warning  Failed                  5m57s                 kubelet                  Error: failed to create containerd container: failed to attach and mount for snapshot 285: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_285, failed:failed to open remote file https://<SNIP>.azurecr.io/v2/jupyter/datascience-notebook/blobs/sha256:4ab67f5756652a448a4b3254525cda922a5cd99d83a888a5b18eb5d87925ef2e: No such file or directory: unknown
  Warning  Failed                  5m43s                 kubelet                  Error: failed to create containerd container: failed to attach and mount for snapshot 285: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_285, failed:failed to open remote file https://<SNIP>.azurecr.io/v2/jupyter/datascience-notebook/blobs/sha256:46b95c4718c0a0c0f8ff61c54cd4aa80203e915bf338cf4c02cf53e1b37baabd: No such file or directory: unknown
    Pulling image "<SNIP>.azurecr.io/jupyter/datascience-notebook:preview"
dgruber commented 7 months ago

That's fantastic, @allyford ! How is AKS node pool support activated from Go SDK (https://github.com/Azure/azure-sdk-for-go/blob/v68.0.0/services/containerservice/mgmt/2022-07-01/containerservice/models.go)? Or is there a header for the client like it was for teleport which can be replaced? ("EnableACRTeleport")

northtyphoon commented 7 months ago

@jabbera we managed to repro the issue which is related to a client side tcp connection failure to registry data proxy server. Unfortunately, it is blocking if you have data proxy enabled on the registry or use private endpoint. We are working on the fix and will update you once we rollout the change. Just headsup, the team is taking thanksgiving holiday this week and response might be slow. I apologize for the inconvenience.

jabbera commented 7 months ago

@northtyphoon you folks rock! No particular rush. It's very narrowly targeted right now and only in my test environment. That said my users are really excited to get their node startup time shaved in half:-)

debajyoti-truefoundry commented 7 months ago

@jabbera we managed to repro the issue which is related to a client side tcp connection failure to registry data proxy server. Unfortunately, it is blocking if you have data proxy enabled on the registry or use private endpoint. We are working on the fix and will update you once we rollout the change. Just headsup, the team is taking thanksgiving holiday this week and response might be slow. I apologize for the inconvenience.

I am facing a similar issue. I used a tag directly and had already generated the image streaming artifact for the same tag.

    spec:
      containers:
        - name: notebook-stream
          image: redacted.azurecr.io/notebooks/jupyter-full:0.2.6

I was unsure whether image streaming was working. I could not see any related events or find useful logs in the overlaybd-snapshotter in the node.

Then, I started using the digest of the streamable artifact directly,

    spec:
      containers:
        - name: notebook-stream
          image: >-
            redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703

It did not work, and I started getting these events,

  Normal   Pulled     7m45s                   kubelet            Successfully pulled image "redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703" in 1.067214459s (1.06722696s including waiting)
  Warning  Failed     7m43s                   kubelet            Error: failed to create containerd container: failed to attach and mount for snapshot 188: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_188, failed:failed to open remote file https://redacted.azurecr.io/v2/notebooks/jupyter-full/blobs/sha256:8822755c906c191c38c37eb8b73648960b24dd77feddb9ac40a903d556a425b8: No such file or directory: unknown
  Normal   Pulled     7m42s                   kubelet            Successfully pulled image "redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703" in 409.655325ms (409.662025ms including waiting)
  Warning  Failed     7m41s                   kubelet            Error: failed to create containerd container: failed to attach and mount for snapshot 188: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_188, failed:failed to open remote file https://redacted.azurecr.io/v2/notebooks/jupyter-full/blobs/sha256:89294667ee3e584d3a06a9e75144b238a0cf99b1849583f0af76a061c9600fb7: No such file or directory: unknown

Let me know if you require any more information. @northtyphoon

northtyphoon commented 7 months ago

@debajyoti-truefoundry it's the same issue. We have the fix ready. However as the Christmas holiday is approaching, we are currently being advised to hold the rollout of the fix. The team is waiting for the next deployment window likely starting in Jan. I sincerely apology for the delay. If you want to try the private build, you can reach me at bindu at microsoft dot com. cc @jabbera

andrey-gava commented 6 months ago

Maybe we get it wrong, but we were expecting that image pull inside cluster with enabled streaming will be done by original digest instead of streaming digest. In other words streaming artifact will be an alias for original one, and will be automatically recognized by aks cluster without need to pass streaming digest.

In our CI at post image build step we populate helm values with digest of built image. For example its digest is: @sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a Acr create streaming one with digest: @sha256:987654323bb6e1596630773bc840c8671df3e806f6bec97213e3f92c38e4ef81 After deploy we see error:

Failed to pull image "name.azurecr.io/master/ci/platform:master@sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
rpc error: code = FailedPrecondition desc = failed to pull and unpack image
"name.azurecr.io/master/ci/platform@sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
failed commit on ref "manifest-sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
unexpected commit digest sha256:987654323bb6e1596630773bc840c8671df3e806f6bec97213e3f92c38e4ef81,
expected sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a: failed precondition

So actually cluster see that there is a streaming artifact exist but fails to pull it because of digest mismatch. To be able to use this future we must change our CI process with steps that will query for streaming artifact digest additionally to original one which is not very convenient.

P.S. After changing digest to streaming one we see that pull works and much faster that in cluster without streaming enabled. Image size ~ 7Gb, pull time 11.822819341s against 2m17.764238179s

jabbera commented 5 months ago

@northtyphoon Hi! Any update on this issue?

debajyoti-truefoundry commented 5 months ago

@northtyphoon Please let us know if there is any update. Thanks!

northtyphoon commented 5 months ago

@jabbera @debajyoti-truefoundry the fix is currently being rolled out worldwide. May I know which region your cluster is running?

northtyphoon commented 5 months ago

@andrey-gava what is the image reference you set in the deployment template? Are you using the tag or the digest directly? cc: @juliusl

jabbera commented 5 months ago

@northtyphoon east us2

northtyphoon commented 5 months ago

@jabbera can you please create a new node pool with streaming enabled or update the existing node pool image? Let me know if it works.

andrey-gava commented 5 months ago

@andrey-gava what is the image reference you set in the deployment template? Are you using the tag or the digest directly? cc: @juliusl

Digest directly.

juliusl commented 5 months ago

Hello. In order for the upgrade to a streamable image to work, you must reference the image by tag. This allows us to check if the digest resolved by that tag has a streamable artifact available.

We specifically do not try and upgrade when you reference an image by digest, because when an image is referenced by digest we assume there should be no further image resolution done. In the future we could consider adding this as an optional override if there is demand for it, however I feel this could have the side-effect of not accurately reflecting what image a container is using.

jabbera commented 4 months ago

@northtyphoon sorry for the delay, I'm in a worse situation now. My pod's won't spin up and there are no log entries in the pod describe output wrt image pull. Let me know if I should put in a ticket or if you would like to engage some other way.

jabbera commented 3 months ago

@northtyphoon checking in on above comment?

northtyphoon commented 3 months ago

@jabbera can you please open a ticket and share the ticket id with me bindu at microsoft dot com?

maneeshcdls commented 3 months ago

Hello, When Artifact streaming is enabled on a Linux node in our Kubernetes cluster, we're experiencing problems with image pulls. Specifically, we're encountering "Failed to pull image" errors during deployments. Additionally, over time, the disk space on the node becomes filled up, leading to the eviction of all pods.

Observations: With Artifact Streaming Enabled on node:

Failed image pulls during deployments. Disk space gradually fills up over time. All pods eventually get evicted due to the lack of available disk space.

With Artifact Streaming Disabled on node: Deployments function as expected. Images are pulled correctly without errors. No significant disk space issues observed.

error: Failed to pull image ".azurecr.io/products/api:master": rpc error: code = Canceled desc = failed to pull and unpack image ".azurecr.io/products/api:master": failed to resolve reference "**.azurecr.io/products/api:master": failed to do request: Head "https://localhost:8578/v2/products/api/manifests/master?ns=.azurecr.io": context cancel

does anyone have any idea? if we have many images in the repo do we get this issue?

juliusl commented 3 months ago

@maneeshcdls thanks for reporting your issue.

A fix that addresses these issues is being released. See https://github.com/Azure/acr/issues/739

cveld commented 2 months ago

I am looking for a bit more technical insights how this streaming actually works? Why is a streaming copy of the original image required? Is it only the image pull process that is optimized (I suspect streaming mainly means all data is coming through a multiplex of sockets without the latency of reconnections per layer request) or is there also some lazy loading kicking in when the pod is started?

ganeshkumarashok commented 2 months ago

Thanks for your interest in how artifact streaming works behind the scenes @cveld.

Streaming involves on-demand image loading: you download the data that is needed, when it is needed. On the nodes, the custom containerd snapshotter and storage driver need to know which chunks of data to download and decompress, during pod workload runtime. And on-demand loading happens at a sub-layer level for performance - layers can be quite large ando often most of that data is not used. We need a mapping between specific files in the image and offset blocks/chunks corresponding to them. At a high level, a streaming version of the same image has this mapping and it is in a special format to make it performant for sub-layer decompression. That can be created with the help of a special format (like default overlaybd, which we use on AKS and ACR) or a separately generated mapping. That is why we create a streaming version of the image and attach it to the original image manifest, when streaming is enabled.

For your other question: artifact streaming is primarily based on lazy-image loading now.

We (me and co-presenter from overlaybd) also shared more about this at KubeCon EU last month: KubeCon link / YouTube, along with additional approaches to address the pod start problem. Happy to answer other questions as well.

tolga-hmcts commented 3 weeks ago

Has the fix (https://github.com/Azure/acr/issues/739) been distributed to UK South? And how can I rollback "az aks nodepool update --enable-artifact-streaming"?