awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
658 stars 223 forks source link

Using S3 MountPoint CSI driver and S3 Express one zone to expedite ML workloads #601

Closed meetreks closed 2 months ago

meetreks commented 3 months ago

Community Note

What is the outcome that you are trying to reach?

We are trying to show customers how to take advantage of the new S3 CSI driver and expedite spark ml workflows

Describe the solution you would like

Our solution will leverage Mountpoint for Amazon S3 with Amazon S3 Express One Zone. S3 Express One Zone is a high-performance, single-zone Amazon S3 storage class designed to deliver consistent, single-digit millisecond data access for your most latency-sensitive applications. It is the lowest latency cloud-object storage class available today, with data access speeds up to 10x faster and request costs 50% lower than S3 Standard. Applications can benefit immediately from requests being completed up to an order of magnitude faster. S3 Express One Zone provides similar performance elasticity as other S3 storage classes

Describe alternatives you have considered

NA

Additional context

If you are using Machine Learning (ML) models that work on Computer Vision, training such models can be expedited, and these jobs can finish faster, hence saving on compute cost time. This is because the data resides on a Single Availability Zone (AZ), and additionally, using Mountpoint for Amazon S3 facilitates faster data access. It's also ideal in an EMR (Elastic MapReduce) Big Data scenario where EMR is AZ-specific, and coupled with S3 Express One Zone operating on the same AZ and with Mountpoint for Amazon S3, it helps prepare the training data and run the training job on the same dataset.

A common pattern customers experience in their compute is repeated reads to the original data store, which is not cost-efficient, as it requires talking to the underlying data store for the same data over the life cycle of the compute. With Mountpoint for Amazon S3, we have recently introduced caching, which optimizes price/performance for repeated data access. Cached reads are 2X faster than normal reads. Using Mountpoint for Amazon S3, recently or repeatedly used data can be cached in EC2 Instance Storage (the storage attached to the EC2) or EBS Volumes. The first read occurs at S3 latency, followed by the lowest latency using Instance Storage or EBS. When you mount an S3 bucket, you can optionally enable caching through flags. You can configure the location and size of the data cache and the amount of time metadata is retained in the cache. When you mount a bucket and caching is enabled, Mountpoint creates an empty sub-directory at the configured cache location if that sub-directory doesn't already exist. When you unmount, Mountpoint deletes the contents of the cache location.

Any Kubernetes application requiring access to S3 to perform high throughput operations (read/write) can make use of the newly introduced Mountpoint for S3 CSI driver. The new CSI driver is built using Mountpoint for S3, and hence our Kubernetes apps are instantly enabled for high throughput without having to make any changes to the app code. It supports self-managed Kubernetes needing to use S3 (via the CSI Driver) and for Amazon Elastic Kubernetes Service (EKS), it is available as a managed add-on.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

github-actions[bot] commented 2 months ago

Issue closed due to inactivity.