Open allyford opened 9 months ago
@allyford is this available for preview yet? I just saw this now and I've been trying to get invited to the teleport preview for ages
Preview coming soon! CLI experience will be available next week. I'll continue to update this Github issue.
is there any roadmap when this could be available for windows workloads. As most of times , images sizes are larger manily for windows workloads as compared to linux.
is there any roadmap when this could be available for windows workloads. As most of times , images sizes are larger manily for windows workloads as compared to linux.
We're currently working on this feature for windows workloads as well. Targeting 2024.
Does this work for AzureLinux or only Ubuntu? https://pixelrobots.co.uk/2023/11/first-look-artifact-streaming-in-preview-for-acr-and-aks/ has a comment that it's ubuntu only but I have yet to see official MSFT documentation.
Both AzureLinux (Mariner) and Ubuntu will be supported.
First test on this did not go so well. Any advice? Ticket: 2311200040008581
Normal Pulled 5m58s kubelet Successfully pulled image "<SNIP>.azurecr.io/jupyter/datascience-notebook:preview" in 1.078s (1.078s including waiting)
Warning Failed 5m57s kubelet Error: failed to create containerd container: failed to attach and mount for snapshot 285: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_285, failed:failed to open remote file https://<SNIP>.azurecr.io/v2/jupyter/datascience-notebook/blobs/sha256:4ab67f5756652a448a4b3254525cda922a5cd99d83a888a5b18eb5d87925ef2e: No such file or directory: unknown
Warning Failed 5m43s kubelet Error: failed to create containerd container: failed to attach and mount for snapshot 285: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_285, failed:failed to open remote file https://<SNIP>.azurecr.io/v2/jupyter/datascience-notebook/blobs/sha256:46b95c4718c0a0c0f8ff61c54cd4aa80203e915bf338cf4c02cf53e1b37baabd: No such file or directory: unknown
Pulling image "<SNIP>.azurecr.io/jupyter/datascience-notebook:preview"
That's fantastic, @allyford ! How is AKS node pool support activated from Go SDK (https://github.com/Azure/azure-sdk-for-go/blob/v68.0.0/services/containerservice/mgmt/2022-07-01/containerservice/models.go)? Or is there a header for the client like it was for teleport which can be replaced? ("EnableACRTeleport")
@jabbera we managed to repro the issue which is related to a client side tcp connection failure to registry data proxy server. Unfortunately, it is blocking if you have data proxy enabled on the registry or use private endpoint. We are working on the fix and will update you once we rollout the change. Just headsup, the team is taking thanksgiving holiday this week and response might be slow. I apologize for the inconvenience.
@northtyphoon you folks rock! No particular rush. It's very narrowly targeted right now and only in my test environment. That said my users are really excited to get their node startup time shaved in half:-)
@jabbera we managed to repro the issue which is related to a client side tcp connection failure to registry data proxy server. Unfortunately, it is blocking if you have data proxy enabled on the registry or use private endpoint. We are working on the fix and will update you once we rollout the change. Just headsup, the team is taking thanksgiving holiday this week and response might be slow. I apologize for the inconvenience.
I am facing a similar issue. I used a tag directly and had already generated the image streaming artifact for the same tag.
spec:
containers:
- name: notebook-stream
image: redacted.azurecr.io/notebooks/jupyter-full:0.2.6
I was unsure whether image streaming was working. I could not see any related events or find useful logs in the overlaybd-snapshotter
in the node.
Then, I started using the digest of the streamable artifact directly,
spec:
containers:
- name: notebook-stream
image: >-
redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703
It did not work, and I started getting these events,
Normal Pulled 7m45s kubelet Successfully pulled image "redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703" in 1.067214459s (1.06722696s including waiting)
Warning Failed 7m43s kubelet Error: failed to create containerd container: failed to attach and mount for snapshot 188: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_188, failed:failed to open remote file https://redacted.azurecr.io/v2/notebooks/jupyter-full/blobs/sha256:8822755c906c191c38c37eb8b73648960b24dd77feddb9ac40a903d556a425b8: No such file or directory: unknown
Normal Pulled 7m42s kubelet Successfully pulled image "redacted.azurecr.io/notebooks/jupyter-full@sha256:7298bf3d18093e4a71500e2a91e3a408eb83b1920c87e81dee6bc2f98fe39703" in 409.655325ms (409.662025ms including waiting)
Warning Failed 7m41s kubelet Error: failed to create containerd container: failed to attach and mount for snapshot 188: failed to enable target for /sys/kernel/config/target/core/user_999999999/dev_188, failed:failed to open remote file https://redacted.azurecr.io/v2/notebooks/jupyter-full/blobs/sha256:89294667ee3e584d3a06a9e75144b238a0cf99b1849583f0af76a061c9600fb7: No such file or directory: unknown
Let me know if you require any more information. @northtyphoon
@debajyoti-truefoundry it's the same issue. We have the fix ready. However as the Christmas holiday is approaching, we are currently being advised to hold the rollout of the fix. The team is waiting for the next deployment window likely starting in Jan. I sincerely apology for the delay. If you want to try the private build, you can reach me at bindu at microsoft dot com. cc @jabbera
Maybe we get it wrong, but we were expecting that image pull inside cluster with enabled streaming will be done by original digest instead of streaming digest. In other words streaming artifact will be an alias for original one, and will be automatically recognized by aks cluster without need to pass streaming digest.
In our CI at post image build step we populate helm values with digest of built image. For example its digest is: @sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a Acr create streaming one with digest: @sha256:987654323bb6e1596630773bc840c8671df3e806f6bec97213e3f92c38e4ef81 After deploy we see error:
Failed to pull image "name.azurecr.io/master/ci/platform:master@sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
rpc error: code = FailedPrecondition desc = failed to pull and unpack image
"name.azurecr.io/master/ci/platform@sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
failed commit on ref "manifest-sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a":
unexpected commit digest sha256:987654323bb6e1596630773bc840c8671df3e806f6bec97213e3f92c38e4ef81,
expected sha256:12345674832657291fce8f4ed84c580932941392897eafc64317ee9589fbcf8a: failed precondition
So actually cluster see that there is a streaming artifact exist but fails to pull it because of digest mismatch. To be able to use this future we must change our CI process with steps that will query for streaming artifact digest additionally to original one which is not very convenient.
P.S. After changing digest to streaming one we see that pull works and much faster that in cluster without streaming enabled. Image size ~ 7Gb, pull time 11.822819341s against 2m17.764238179s
@northtyphoon Hi! Any update on this issue?
@northtyphoon Please let us know if there is any update. Thanks!
@jabbera @debajyoti-truefoundry the fix is currently being rolled out worldwide. May I know which region your cluster is running?
@andrey-gava what is the image reference you set in the deployment template? Are you using the tag or the digest directly? cc: @juliusl
@northtyphoon east us2
@jabbera can you please create a new node pool with streaming enabled or update the existing node pool image? Let me know if it works.
@andrey-gava what is the image reference you set in the deployment template? Are you using the tag or the digest directly? cc: @juliusl
Digest directly.
Hello. In order for the upgrade to a streamable image to work, you must reference the image by tag. This allows us to check if the digest resolved by that tag has a streamable artifact available.
We specifically do not try and upgrade when you reference an image by digest, because when an image is referenced by digest we assume there should be no further image resolution done. In the future we could consider adding this as an optional override if there is demand for it, however I feel this could have the side-effect of not accurately reflecting what image a container is using.
@northtyphoon sorry for the delay, I'm in a worse situation now. My pod's won't spin up and there are no log entries in the pod describe output wrt image pull. Let me know if I should put in a ticket or if you would like to engage some other way.
@northtyphoon checking in on above comment?
@jabbera can you please open a ticket and share the ticket id with me bindu at microsoft dot com?
Hello, When Artifact streaming is enabled on a Linux node in our Kubernetes cluster, we're experiencing problems with image pulls. Specifically, we're encountering "Failed to pull image" errors during deployments. Additionally, over time, the disk space on the node becomes filled up, leading to the eviction of all pods.
Observations: With Artifact Streaming Enabled on node:
Failed image pulls during deployments. Disk space gradually fills up over time. All pods eventually get evicted due to the lack of available disk space.
With Artifact Streaming Disabled on node: Deployments function as expected. Images are pulled correctly without errors. No significant disk space issues observed.
error: Failed to pull image ".azurecr.io/products/api:master": rpc error: code = Canceled desc = failed to pull and unpack image ".azurecr.io/products/api:master": failed to resolve reference "**.azurecr.io/products/api:master": failed to do request: Head "https://localhost:8578/v2/products/api/manifests/master?ns=.azurecr.io": context cancel
does anyone have any idea? if we have many images in the repo do we get this issue?
@maneeshcdls thanks for reporting your issue.
A fix that addresses these issues is being released. See https://github.com/Azure/acr/issues/739
I am looking for a bit more technical insights how this streaming actually works? Why is a streaming copy of the original image required? Is it only the image pull process that is optimized (I suspect streaming mainly means all data is coming through a multiplex of sockets without the latency of reconnections per layer request) or is there also some lazy loading kicking in when the pod is started?
Thanks for your interest in how artifact streaming works behind the scenes @cveld.
Streaming involves on-demand image loading: you download the data that is needed, when it is needed. On the nodes, the custom containerd snapshotter and storage driver need to know which chunks of data to download and decompress, during pod workload runtime. And on-demand loading happens at a sub-layer level for performance - layers can be quite large ando often most of that data is not used. We need a mapping between specific files in the image and offset blocks/chunks corresponding to them. At a high level, a streaming version of the same image has this mapping and it is in a special format to make it performant for sub-layer decompression. That can be created with the help of a special format (like default overlaybd, which we use on AKS and ACR) or a separately generated mapping. That is why we create a streaming version of the image and attach it to the original image manifest, when streaming is enabled.
For your other question: artifact streaming is primarily based on lazy-image loading now.
We (me and co-presenter from overlaybd) also shared more about this at KubeCon EU last month: KubeCon link / YouTube, along with additional approaches to address the pod start problem. Happy to answer other questions as well.
Has the fix (https://github.com/Azure/acr/issues/739) been distributed to UK South? And how can I rollback "az aks nodepool update --enable-artifact-streaming"?
Previously known as Teleport, Artifact Streaming will enable customers to store the container images within a single registry, manage, and stream the container images to Azure Kubernetes Service (AKS) clusters in multiple regions. Artifact Streaming will deploy container application to multiple regions without having to create multiple registries or enable geo-replication.
Azure Container Registry (ACR) and Azure Kubernetes Service (AKS) will soon support artifact streaming. Artifact streaming for AKS provides customers the ability to accelerate containerized workloads in the cloud by dramatically reducing the overall startup time by almost 15% when connected to ACR.
Artifact streaming will empower customers to scale resources on AKS seamlessly by not having to wait for long pull times for each Kubernetes pod. Customers with Linux amd64 container images will be supported and we have plans to support Windows and arm64 container images in the future.
The ACR and AKS team would also like to give a huge thanks to Alibaba for their contributions to the containerd Overlaybd project.
We can’t wait to hear what our customers have to think and look forward to hearing feedback on further improving this feature.
1785
ACR Github Issue