Shared file system between ingest and storage - what is the correct AWS approach?

terrywbrady commented 2 years ago

The amount of content deposited into Merritt varies significantly from week to week. Deposits from each campus are sporadic and project-driven. One week in May 2022, we had 19TB of content deposited.

When a campus does initiate a large ingest, our current provisioned I/O is overwhelmed by the content.

What access this shared content

3 Ingest Servers: download from campus/cloud, write to disk, fixity check content
2 Storage Servers: read from disk, upload to S3 compatible storage (may or may not be S3)

Past implementations

Each server had dedicated EBS volume
- Single server EBS was too limiting and required us to make an extra copy of content
- EBS volumes frequently ran out of space

Options

EFS has been too slow and possibly too expensive
Shared EBS
Multi-attach EBS volume with GlusterFS
Lustre FS
- Ashley will read up on this
Other options?
Co-locate Ingest and Storage with single large EBS
- this may have other limitations
- harder to containerize

What would AWS recommend?

Martin will add this to the agenda for an IAS conversation with Kevin

Workflow

Ingest reads a manifest
Ingest downloads content from campus data centers or cloud providers
Ingest wites content to EFS
Ingest performs fixity check (checksum) from EFS
Ingest creates an ingest manifest
Storage reads ingest manifest
Storage loads content to S3-compatible service (often not AWS)
Storage performs checksum on cloud storage copy
Content is deleted from EFS

File System Needs

Previously used EBS
- Forced us to make an extra copy of the content
- At critical times, we would fill disks
Scalable size (suggests EFS)
- File system returns to empty on a regular basis
- Files reside on the FS until they are loaded to S3 and verified
- Failed jobs remain on disk until investigated and resolved
- Restarts often need to re-download content to a new folder
- Question: what is the largest size our EFS has reached in the last year?
- Only a couple months of data are accessible. The size has grown as high as 7.3T.
- In one week in May, we processed 18T in the week
- Based on this, I would presume we would want 10-20T if we had to provision a specific amount (my guess)
- it would be good to validate this against the EFS size in the past year
Shared by 7 hosts
- This is expected to grow as Merritt deposits scale up
- Needs to be shareable with servers or containers
Performant as it scales
- Since the content is transient, it needs to perform well even when the file system is empty
- Ideally we could scale up the number of servers to process more content
- Currently, EFS performance is dictating a maximum number of hosts

Options

EFS with provisioned throughput
- At peak periods dynamically raise/lower throughput in response to processing needs
- How dynamic could this be
- At peak periods, dynamically scale ingest and storage instances (container or servers)
FSx provisioned to adequately handle peak processing
FSx - dynamically provisioned to match processing needs (4T to ??T)

terrywbrady commented 2 years ago

https://pilotcoresystems.com/insights/ebs-efs-fsx-s3-how-these-storage-options-differ/ Amazon FSx AWS EFS has you covered for all file-system storage requirements. Or does it? EFS works with EC2 instances as a managed NAS filer. FSx, on the other hand, offers a managed Windows Server environment that runs Windows Server Message Block services.

pilotcoresystems.compilotcoresystems.com EBS vs EFS vs FSx vs S3: How These Storage Options Differ | Pilotcore How do you choose between EBS vs EFS vs FSx vs S3? We help you pick the Amazon storage option that is right for your use case. (2 MB) https://pilotcoresystems.com/insights/ebs-efs-fsx-s3-how-these-storage-options-differ/

:thankyou: 1

9:45 This page sounds closer to what we were discussing, so my prior link may not be so helpful: https://aws.amazon.com/fsx/lustre/faqs/

Ashley Gould 9:51 AM from the same faq page: Amazon FSx also integrates with Amazon S3, making it easy for you to process cloud data sets with the Lustre high-performance file system. When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and automatically updates the contents of the linked S3 bucket as files are added to, changed in, or deleted from the file system. Amazon Web Services, Inc.Amazon Web Services, Inc. Cloud Object Storage – Amazon S3 – Amazon Web Services Amazon S3 is cloud object storage with industry-leading scalability, data availability, security, and performance. S3 is ideal for data lakes, mobile applications, backup and restore, archival, IoT devices, ML, AI, and analytics.

sfisher 10:20 AM FSx sounds neat for working with data processing and representing S3 files as local ones would be nice for operating on them. I'm guessing it's fairly pricey?

Colin Thompson 10:48 AM roughly double the price of EFS, if I'm reading this correctly: https://aws.amazon.com/fsx/lustre/pricing/ https://aws.amazon.com/efs/pricing/

terrywbrady commented 2 years ago

7/27 - Meeting with IAS and AWS Reps

Consider FSx for ZFS (compatible with nfs v3 client)
- Must provision space
- Must provision throughput

Refactoring consideration

Consider dynamic creation and mounting of file systems on demand
- Really need a good size estimate for this to be successful
Consider using S3 instead of a file system
- Slower throughput for individual, but scales indefinitely
- Possible to stream downloads directly into S3 rather than to a file system
- Could possibly process many operations with lambda
- Space allocation is dynamic
- No penalty for parallel operations

terrywbrady commented 2 years ago

What about allocating specific file systems for priority collections?

ashleygould commented 2 years ago

Estimates for ZFS changes

https://aws.amazon.com/fsx/openzfs/pricing/

SSD storage capacity:   $0.090 per GB-month
Throughput capacity:    $0.260 per MBps-month
SSD IOPS:               $0.0060 per IOPS-month

Assume you want to store 5 TB of general-purpose file data using SSD storage in the US West Region. Provision a 5 TB Single-AZ file system with 256 MB/s of throughput capacity.

Storage:        5 TB x $0.09 per GB-month       = $461/mo
IOPs:           15360 (3 per GB storage)        =   $0/mo
Throughput:     256 MB/s x $0.26 per MB/s-month =  $67/mo

Total monthly charge:                             $528/mo

10TB SSD storage

Storage:        10 TB x $0.09 per GB-month      = $922/mo
IOPs:           30720 (3 per GB storage)        =   $0/mo
Throughput:     256 MB/s x $0.26 per MB/s-month =  $67/mo

Total monthly charge:                             $989/mo

ashleygould commented 2 years ago

Testing zfs on stage

I am working with IAS to create a 2 TB ZFS volume. On each of the uc3-mrt-ingest-stg and uc3-mrt-store-stg hosts. We will mount this volume on /apps/ingest-stg-zfs. Once ready, Mark will reconfigure uc3-mrt-ingest-stg to write data to this new mount point and restart ingest. Merritt team will then load some large ingests into stage to see how the new volume performs.

Martin and I are done adding the ZFS volume to all 4 stage ingent and store hosts. Please configure ingest to write to the new mount point /apps/ingest-stg-zfs.

agould@uc3-ingest01x2-stg:~> df -h /apps/ingest-stg-zfs
Filesystem                                             Size  Used Avail Use% Mounted on
fs-0b9822a1af853be7a.fsx.us-west-2.amazonaws.com:/fsx  2.0T     0  2.0T   0% /apps/ingest-stg-zfs

Removing Stage ingest worker 2 from new ALB to configure new ZFS disk.

It is a bit more involved than a single symlink to change mount point. Should be no problem though

SSM
        /uc3/mrt/stg/ingest/config/ingestQueuePath      /apps/ingest-stg-shared/ingest_home/queue
Tomcat
        webapps/ingestqueue -> /apps/ingest-stg-shared/ingest_home/queue
ingest_home
        /dpr2/ingest_home/queue -> /apps/ingest-stg-shared/ingest_home/queue

mreyescdl commented 2 years ago

Stage Ingest and Storage are now using shared ZFS disk.

elopatin-uc3 commented 2 years ago

@terrywbrady we talked about status of ZFS testing on stage in today's (8/15) team meeting. Ashley brought up an aspect of ZFS that we were not yet aware of, called thin provisioning. This type of provisioning allows for setting a top end of the ZFS allocation, but also then allows for one-way growth beyond that in case a large submission breaches the initially specified allocation limit.

@mreyescdl is going to allow the current test ingest to complete on stage (Nuxeo UCM San Joaquin collection content). Then using a new volume with a small 100GB ZFS allocation set and thin provisioning active, we'll start another test to observe this type of provisioning in action and collect test data. @ashleygould is also going to discuss temporarily setting ingest hosts to m5.large on stage with IAS for use during this test, as we've observed significant CPU I/O wait on the smaller existing hosts are are unsure if this is due to lack of network bandwidth or if it's disk access related.

ashleygould commented 2 years ago

Mark tested with a large submission and results were better than EFS, but ingest instances had high CPU IOWait. Stage instances are t3.small for ingest and m5.large for store.

next steps:

OPS: research instance types and IO throughput
OPS: research zfs thin provissioning more
OPS/IAS: research zfs permormance monitoring
IAS: set up a thin provissioned volume on stage
IAS: set up librato monitoring on zfs instance
IAS: change instance type on one of the ingest stage nodes to m5.large
OPS/MRT: make script to swap mountpoints on ingest nodes
OPS: reboot ingest nodes prior to re-run
MRT: re-run large submission
OPS: capture ec2 stats:
- IO throughput
- NFS throughput
- cpu iowait
- network throughput
- network throttling (ethtool)

Links:

ashleygould commented 2 years ago

Update:

thin-provissioning is a bust. Does not do what we were hoping for. So we did not create a new ZFS filesystem. We will continue to test on the existing one.

Librato does not support FSx service, so not dashboards.

But - Martin Haye set up a cloudwatch dashboard for us: https://cloudwatch.amazonaws.com/dashboard.html?dashboard=UC3_ZFS_fs-0b9822a1af853be7a_Da[…]NDMxLTRiYWEtOWYwMC0xZTZlMjU3NjkwODciLCJNIjoiUHVibGljIn0= You have to click on the "options" and select the sample period. Otherwise it selects one for you, but does not tell you what it is.

Martin changed ec2 instance type on one of the uc3-mrt-ingest-stg hosts to c5n.large. This should give us much better network bandwidth, which I think may have been the bottle neck in the first trial.

mreyescdl commented 2 years ago

@ashleygould ZFS testing started on Stage - expect TBs of data to be submitted to Ingest workers over the next few days Workers now => (01 - c5n.large, 02 - t2.small)

mreyescdl commented 2 years ago

Ingest is having problems removing payload directory

[error] HandlerCleanup: Failure in removing: /dpr2/ingest_home/queue/bid-bd2f4790-bc24-412f-94d6-d7c37ce341e0/jid-1aa018ba-86d0-4fb3-9b91-64025ecd4993/producer   Continuing

This is due to ZFS hidden files being created:

$ ls -la producer/
total 14
-rw-r--r-- 1 dpr2 dpr2 1058816 Aug 23 14:53 .nfs00000000000000100000005b

mreyescdl commented 2 years ago

Problem was in the Manifest processor in core library. Valid manifest payload did not trigger error, but regular file payload would. Here is fix.
https://github.com/CDLUC3/mrt-core2/pull/16

elopatin-uc3 commented 2 years ago

Thanks for finding the root cause for this @mreyescdl

mreyescdl commented 2 years ago

Legacy EFS Stage disk contents (fs-6fe432c4.efs)

$ du -sh * 3.2G dataone 211M frontera 219G ingest_home 31G palestinian_museum 716K terry

I'll cleanup the ingest_home disk and move the rest to new ZFS disk.
Please remove any data if not needed @terrywbrady @elopatin-uc3

terrywbrady commented 2 years ago

@mreyescdl , I deleted the terry directory. I think @elopatin-uc3 will need to weigh in on the others.

mreyescdl commented 2 years ago

IAS request made to decommission Stage EFS and to reduce IO Throughput on prod

- Decommission EFS disk fs-6fe432c4.efs.us-west-2.amazonaws.com which is mounted
on Stage Ingest and Storage workers

- Reduce IO throughput on EFS production disk
fs-3b22fd91.efs.us-west-2.amazonaws.com from 50MB/s to 10MB/s

CDLUC3 / mrt-doc