[EKS] [Feature Request ]: Support for Image Streaming in EKS to Accelerate Large Container Image Pulls

suyog1pathak commented 1 month ago

Feature Request: Support for Image Streaming in EKS to Accelerate Large Container Image Pulls

Background:

In environments where large container images are used frequently, the time it takes to pull these images can significantly impact application startup times and cluster performance. In GKE (Google Kubernetes Engine), Image Streaming has been introduced to address this issue. With Image Streaming, container images are pulled on-demand as needed, rather than being fully downloaded before the container starts. This dramatically reduces startup times for large images, especially when applications don’t need the entire image at launch.

Here is the reference to GKE’s Image Streaming feature: Image Streaming in GKE

Request:

I would like to propose adding a similar feature to Amazon EKS that allows for streaming container images directly from registries. This feature would benefit users who work with large container images and need to improve their application startup times.

Key Benefits:

Faster Pod Startup: Containers can begin running sooner, reducing the impact of image pull times on deployment.
Optimized Network Usage: Only the required layers or parts of the image are pulled initially, leading to more efficient bandwidth usage.
Improved Cluster Efficiency: Large images would not block nodes while downloading, allowing workloads to be scheduled faster and cluster resources to be utilized more effectively.
**faster scaling

Current Workarounds:

Currently, users like myself have to implement workaround solutions, such as:

Pre-pulling images using DaemonSets.
Using SSH or cloud-init scripts to pull images on node boot.
Caching images via other methods.

While these solutions work, they add complexity and introduce unnecessary overhead into the cluster setup. A native EKS feature similar to GKE Image Streaming would provide a clean, scalable, and efficient way to handle large container images in the cluster.

jlbutler commented 1 month ago

Thanks for opening this issue. The Streaming OCI project is aligned functionally with this request, and we have considered how we might make that work for EKS customers.

To help us with that, we would be interested to hear about how customers think about the tradeoffs between starting containers faster vs long-term stability and performance.

Related to performance, how does starting a container within seconds contrast with potential impact to local IO performance in container instances? Put another way, what would be a rough threshold for acceptable performance impact related to streaming when getting super fast start times?

On stability, what would be the best way for EKS to help customers handle new types of errors not seen before in container runtime? For example if the repository becomes unreachable while a workload is running and streaming fails, what sorts of errors or retries would you expect?

Thanks in advance for any feedback.

suyog1pathak commented 1 month ago

Performance: Fast Container Start Times vs. Local IO Impact

Container Start Time Expectation: Ideally, containers should start in under 1-2 seconds for small and medium-sized images when streaming is enabled. This is especially useful in scenarios where services scale based on demand.
IO Performance Threshold: A slight decrease in IO performance (around 5-10%) would likely be acceptable, especially if it enables faster scale-up of services in response to traffic spikes. However, a more significant IO impact could affect workloads that rely on local storage, databases, or intensive file I/O operations.

Stability: Handling Streaming Errors

Expected Behavior if Streaming Fails: If the OCI repository becomes unreachable during runtime, the most important thing is to ensure that already-running containers are unaffected. Perhaps, in this scenario, EKS could attempt to complete the current operation (e.g., pulling layers already initiated), or retry fetching from the repository. If the image stream is interrupted, the container should fall back to cached layers and remain operational.
Graceful Retries: There should be configurable retry policies, ideally backoff retries for repository connectivity issues. If a retry mechanism isn't successful, the workload should trigger an alert but continue to run based on the layers already cached locally.
Notifications or Alerts: Integrating failure notifications into the EKS control plane would be useful. Users would appreciate knowing when streaming errors occur, but the failure should not result in the crashing or restarting of the running containers unless absolutely necessary.

I would also like to suggest incorporating a local caching solution that leverages AWS services like EFS for faster image retrieval and scaling. This would be particularly useful for large images, such as those used in machine learning (ML) models.

Use Case Example: For instance, if an EKS node pulls an image that is 15GB in size (e.g., a machine learning model image), it would only need to pull the image once. After that, the image could be cached in AWS EFS, and subsequent nodes could pull the image from the local EFS cache. This would significantly reduce startup times, as the data transfer within the VPC would be much faster compared to external image pulls.
Flexibility: This solution would be especially helpful for users who do not wish to use ECR with a VPC endpoint. It provides an alternative that relies on local caching for faster scaling while minimizing external network dependencies.

jlbutler commented 4 weeks ago

@suyog1pathak thanks so much for the detailed response.

I'd like to leave this issue as a general recommendation (Streaming, vs a specific implementation). However, we do have an existing issue specifically related to Streaming OCI (SOCI) that I failed to mention in my response here.

Thanks again!

aws / containers-roadmap