ITISFoundation / osparc-issues

🐼 issue-only repo for the osparc project
3 stars 5 forks source link

Filesystem concept (Feb.) #1313

Open SCA-ZMT opened 5 months ago

SCA-ZMT commented 5 months ago

NIH milestone

This is also a milestone for NIH, due by Y8Q2 (Feb. 2025)

Large Files

Improve handling of large files and reduce file-transfer operations as much as possible. This requires some research. The solution should ideally also work on-premise. Storage might need to be compeletey overhauled.

Data Accessibility [This is a Milestone for NIH Year 8 Q2]

Provide infrastructure to users to inspect/download/zip/delete files. This is a Milestone in NIH Year 8 Q2 (see https://github.com/ITISFoundation/osparc-issues/issues/1635)

Shared Folder(s) - Advanced

Provide users the possibility to mount arbitrary data into their services:

### Tasks
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1442
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5833
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5872
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1227
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6245
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6246
### Eisbock
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1629
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1619
- [ ] https://github.com/ITISFoundation/osparc-issues/issues/1630
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6243
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/6244
SCA-ZMT commented 5 months ago

@mguidon could you please edit the issue description with more info / tasks?

matusdrobuliak66 commented 3 months ago

First brainstorming took place with @sanderegg, @mguidon, @matusdrobuliak66. We would like to experiment with AWS Elastic File System in three stages:

  1. Caching images
  2. Caching workspace
  3. Reusing the same workspace across multiple EC2 instances

We may potentially investigate the AWS DataSync option, but for now, we prefer to stick with RClone.

Why?

GitHK commented 3 months ago

@matusdrobuliak66 @sanderegg @mguidon why was rclone discarded? Has anybody thought about rclown mount which uses fuse and will stream data on the go on access and use a cache for writing back to S3.

This will allow us to start services without waiting. Also saving will go to S3 directly. So once the user is done interacting with the FS, it's as if the file was already on S3.

sanderegg commented 3 months ago

@GitHK rclone is not discarded. we go stepwise. and until fixed, RClone presents some bad habit of blowing up without anyone knowing why.

mguidon commented 3 months ago

@GitHK Indeed, we did not discarcd it, we go step by step. For now rclone sync should do exactly what we want without the need for fuse.

GitHK commented 3 months ago

After some issues with rclone, it will be phased out in favour of aws s3 sync. If we require fuse, we could use s3fs-fuse which is also managed by AWS.

matusdrobuliak66 commented 3 months ago

Investigation

Caching images

We want to have a quick startup time and get rid of buffer machines.

1. EFS (docker image save/load)

2. EFS (moving Image data to EFS via symbolic link)

3. EBS snapshot/volume (pre-baked AMI)

4. Multi-attach EBS

Caching workspace

1. s3fs

2. Mountpoint S3

3. EFS

General S3 comment

As mention by @sanderegg if we setup correctly S3 Endpoint + Region where EC2 <-> S3 are located we can have zero S3 costs!

Conclusion/Recommendation

sanderegg commented 3 months ago

One of them links that I already did before

matusdrobuliak66 commented 3 months ago

Investigation (Part 2)

Caching Images

Caching Workspace

Two working examples (mounting EFS to docker container):

volumes: wp_gary_gitton: driver_opts: type: nfs o: addr=fs-.efs.us-east-1.amazonaws.com,rw,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport device: :/docker_compose_test


It seems there is not straight forward option how to limit docker volume size when mounting. EFS doesn't support quotas. In any case I think it is not necessary. We can use CloudWatch and for example Lambda to monitor and manage EFS usage programmatically.
- Script to Monitor Directory Sizes: A script runs periodically to check the size of each project's directory and sends this data to CloudWatch.
- CloudWatch Alarms: Alarms are set up to monitor the directory sizes and trigger actions when limits are exceeded.
- Lambda Function: A Lambda function handles the alarms and takes appropriate actions, such as enforcing policies.
  - For example:  Using EFS Access Points with IAM Policies -> we can restrict access based on some policy.  
sanderegg commented 3 months ago

@matusdrobuliak66 about differences between shutdown and termination, please look here: AWS reference, this link was also already referenced from before. Now regarding what you say with the EBS volume as "active", since on start/stop mechanism the actual "machine" changes are you sure that the access times when you restart the machine are on par with a hot machine? There are simple tests (like the pre-warm up test of AWS) to check that. if parsing the blocks takes a long time then that means the volume is there but 10-100 times slower. Please check it. Keeping volumes around is costly.

Another question, how do you setup the EBS volumes without EC2s?

matusdrobuliak66 commented 3 months ago

Conclusion/Summary

mguidon commented 3 months ago

Thanks @matusdrobuliak66. This looks like a plan. Lets keep rclone on the radar for mounting other filesystems (e.g. sftp and 3rd party providers). I suggest to immediately go on with 1, 2 and 4 and continue investigating 3.

matusdrobuliak66 commented 3 months ago

EFS Experimentation

testing:

Outputs:

EFS (Bursting Mode)

EFS (Elastic Mode)

Note:

matusdrobuliak66 commented 3 months ago

Mounting EFS to the Simcore node: https://git.speag.com/oSparc/osparc-infra/-/merge_requests/230#7af1093b9e6a1427b21a22270b4df48cc5311648

matusdrobuliak66 commented 2 months ago

This is how you can enable it for testing purposes: https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/639