FlowFuse / flowfuse

Connect, collect, transform, visualise, and interact with your Industrial Data in a single platform. Use FlowFuse to manage, scale and secure your Node-RED solutions.
https://flowfuse.com
Other
269 stars 63 forks source link

Persistent File Storage in k8s environments #3056

Closed knolleary closed 2 months ago

knolleary commented 11 months ago

Description

We introduced the File Nodes and the File Server as a workaround to the fact our NR instances do not have a persistent file system. This allowed us to provide 'working' File Nodes that were familiar to existing NR users, however they have some significant drawbacks.

Following a number of discussions on this topic, we want to revisit the original decision not to attach persistent storage volumes to our cloud-hosted Node-RED instances.

This is only scoped to the k8s driver in the first instance. Docker will require a different approach and LocalFS already has local file system access.

The goal will be for each instance to have a volume attached with the appropriate space quota applied.

Open questions:

User Value

Prior to this, interaction with cloud-based file storage was only possible using our own, custom, file nodes. This PR will allow any nodes (e.g. ui builder, sqllite) to have file persistence when running on FlowFuse Cloud.

### Tasks
- [ ] https://github.com/FlowFuse/flowfuse/issues/3434
- [ ] https://github.com/FlowFuse/flowfuse/issues/4003
- [ ] Persistent Storage quota handling
- [ ] https://github.com/FlowFuse/flowfuse/issues/4055
- [ ] Update terraform scripts to support persistent storage (https://github.com/FlowFuse/terraform-aws-flowfuse/issues/10)
- [ ] [migration] Identify list of existing FFC users with stored files
- [ ] [migration] Identify process for migrating existing files to persistent storage
- [ ] Reach out to the instances running on FF CLoud that will be impacted by this

Customer Requests

MarianRaphael commented 10 months ago

See also: https://github.com/FlowFuse/flowfuse/issues/1779

hardillb commented 4 months ago

@joepavitt should this now be on the dev board so it can be in the design stage?

joepavitt commented 4 months ago

Thanks for checking @hardillb

hardillb commented 4 months ago

Assumptions:

Questions:

Research required:

Mounting any volume on /data (the userDir) would mean that node_modules would persist across restarts. This would mean that installed nodes would be persistent decreasing start up time, but this would cause problems when stack changes happen, as this could change the NodeJS version and require a rebuild of any native components.

We would also want the mount point to be in the "Current working directory" of the Node-RED process, so that any files created without a fully qualified path end up in the mounted volume. The core Node-RED File nodes have an option that can be set in settings.js to control this but I don't think any 3rd party nodes honour this setting.

AWS Storge options:

EBS https://aws.amazon.com/ebs/ - block based, need File System on top (but K8s provisioner will format on creation) EFS https://aws.amazon.com/efs/ - self scaling filesystem based FSx https://aws.amazon.com/fsx - not sure this would work for what we want, it's more about a SAN in the cloud by the look of things S3 https://aws.amazon.com/s3 - Object Storage, while it can look like a filesystem, don't think this is what we want

Steve-Mcl commented 4 months ago

Would having a filesystem (automatically) permit virtual memory and thus improve the memory issues/crashes witnessed while installing?

Mounting any volume on /data (the userDir) would mean that node_modules would persist across restarts. This would mean that installed nodes would be persistent increasing start up time

I thought having persistent FS would decrease start up time? (typo?)

hardillb commented 4 months ago

Would having a filesystem (automatically) permit virtual memory and thus improve the memory issues/crashes witnessed while installing?

Not possible, you can't add swap space inside a container

I thought having persistent FS would decrease start up time? (typo?)

Yes typo

Steve-Mcl commented 4 months ago

Not possible, you can't add swap space inside a container

I remember seeing it was an alpha feature some time back.

seems it is now in beta: https://kubernetes.io/blog/2023/08/24/swap-linux-beta/

Totally happy to be told I am reading the wrong thing about an unrelated subject

hardillb commented 4 months ago

No, that is not useful, that is for overall memory usage of the whole node, not on a per pod basis (also off topic for this issue)

hardillb commented 4 months ago

Also need to decide on quota implementation, It looks like the smallest increment we can mount is 1GB on AWS.

Need to know where this will sit in the Team/Enterpise level and what happens on migrations between levels in FFC (given the work Nick has had to do for instance sizes being unavailable at higher levels)

ppawlowski commented 4 months ago

We should approach this topic from two perspectives - core app in general and FFC on EKS.

For the first one - the core app (or probably the k8s driver) should use dynamic storage provisioning approach - create a Persistent Volume Claim based on the provided storage class configuration and use it in the deployment definition. As a software provider, we cannot determine each possible storage class. As stated in the linked documentation, the cluster administrator is responsible for creating a storage class that meets the requirements. The name of the storage class should be passed to the application as a configuration parameter.

From the FFC perspective - once the above is implemented, we are limited to EBS and EFS. We should aim to use EFS than EBS, due to the following reasons:

Having all the above in mind, EFS should be our first choice. However, behind EFS there is an NFS protocol. My main concern is its performance. Before making any production-ready decisions I will suggest making strong PoC first.

Although using AWS S3 as an AWS EKS storage is possible via a dedicated CSI driver, we should avoid it since it does not support dynamic provisioning.

References: https://zesty.co/blog/ebs-vs-efs-which-is-right/ https://www.justaftermidnight247.com/insights/ebs-efs-and-s3-when-to-use-awss-three-storage-solutions/

knolleary commented 4 months ago

Summary of discussion between @hardillb and myself:

  1. This option will only be available to AWS hosted instances using the k8s driver - eg FFC and FFDedicated
    1. Self-hosted k8s users will require design work on what storage services can be used. Out of scope for first iteration.
  2. EFS is the first choice of backend for this.
  3. Need to clarify some of the limits EFS applies to ensure its a solution we can scale (see below)
  4. Volume will be mounted as /data/storage and we'll update nr-launcher to use that as the working directory of the NR process.

Migration from existing file store

Exact details TBD, but one option will be for nr-launcher to copy files back from the file-store prior to starting Node-RED the first time it starts up with the new storage option.

We will identify the current list of instances actively using the file store and assess the scale of migration needed. It may be we can apply something more manual at a small scale - although need to consider self-hosted customers who choose to adopt this.

Availability

We already provide a storage quota per team type - but that is limited to our File Nodes and has limited uptake (will get exact numbers to back this assertion up)

We have two options:

Ultimately this will be a choice we can make further down the implementation as it will be a final stage configuration to apply to the platform.

Open questions

The following items need some additional research to ensure we have a scalable solution.

The EFS limits are documented as:

We provide each instance its storage via an access point on a volume, and each EFS volume can accommodate 120 access points - thus we'll have capacity for 120k instances. The volume limit is also one that can be increased on request. We'll need a way to manage the mapping of instance to volume to ensure utilisation.

What is not currently clear is the mount points per VPC limit; does that apply to the underlying nodes or the pods (eg individual NR instances). That is an order-of magnitude difference - and if its the latter, we're already beyond that limit. @hardillb is following up on this via AWS support forums.

knolleary commented 4 months ago

Clarifications on the EFS limits:

ref: https://repost.aws/questions/QUOS-IQj4pSa2TZ2YouHe_AA/efs-limits-in-and-eks-environment-total-number-of-volumes-access-points-mount-points

hardillb commented 3 months ago

Looking at what will be needed for AWS EFS with AccessPoints I think we will need 2 separate storage solutions.

hardillb commented 3 months ago

Maybe not

https://aws.amazon.com/blogs/containers/introducing-efs-csi-dynamic-provisioning/

https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/README.md

joepavitt commented 3 months ago

@hardillb are you able to provide a rough delivery date for this please?

hardillb commented 3 months ago

Assuming testing today goes well, the technical parts are pretty much done, with the exception of how to enforce the quota requirements.

At this time I have no idea how long that will take.

joepavitt commented 3 months ago

Any updates please @hardillb - release next week, and marketing asking whether this highlight will be delivered

hardillb commented 3 months ago

The code changes are up for review

We need to:

joepavitt commented 3 months ago

Okay, and when will we answer those questions, who is responsible for answering/actioning?

hardillb commented 3 months ago

I'll get with @ppawlowski tomorrow to install the EFS driver so it's ready

The question on access was asked higher up and then left, it's a product call on if we make this only available to higher tiers, but the old file storage is currently available to all just with different quota sizes.

The fact we don't have a quota solution for this at the moment may impact the last point.

joepavitt commented 3 months ago

@hardillb status update please - ready to go for tomorrow?

hardillb commented 3 months ago

@joepavitt should be, I need the following reviewing/merging:

I'm finishing off the last of the environment prep at the moment.

joepavitt commented 3 months ago

@hardillb assuming we can close this out now?

knolleary commented 3 months ago

The core feature has been delivered - so yes, I think we can close this off.

There are some residual tasks to complete which we should get raised separately.