Asynchronously attach PVs to Workspaces

l0rd commented 5 years ago

Is your enhancement related to a problem?

No matter how fast we get to bootstrap a Che workspace, no matter how many external resources we are able to pre-pull (images, extensions, examples source code), we will always need to wait 20+s for a PV to be attached and mounted on the workspace pod.

Describe the solution you'd like

New Workspace lifecycle:

Workspaces Startup phase
- Pods (workspace and data sync) are started in parallel
Startup data sync phase:
- Data flow goes from the persistent volume (rsync server) to the ephemeral volume (rsync client)
- Containers in the Workspace Pod are started but are not allowed to write in the ephemeral volume
Normal workspace usage phase
- Containers in the Workspace Pod have full R/W access to the persistent volume
- Data flow goes from rsync client to rsync server
Workspace Shutdown phase
- Containers in the Workspace Pod are stopped and data are flushed to the ephemeral volume
Shutdown data sync phase
- Data are transferred from the ephemeral volume to the persistent volume
- Pods are destroyed

Workspace components in Read-only mode

In the "Startup data sync phase" the user will already be able to use the editor and plugins but those should behave in a read-only mode until all the data has been synced to the ephemeral volume. That means that Che editors (for example theia) should be able to work on read only mode (initially this can be done by showing a progress bar that shows the data sync and not allowing the user to access theia).

rsync protocol

Rsync is mentioned as the remotes files synchronisation protocol but that’s just an example. If there is a better alternative, let's use it.

Ideas to improve performances (even more)

The rsync server pod could be "always up" and its lifecycle and the workspace lifecycle would be distinct.
The rsync server pod could serve every user workspace
When a workspace is created for the first time the Startup data sync phase can be skipped

Florent's edit:

Tasks

amisevsk commented 5 years ago

This also would help in solving issues with e.g. Gluster being too slow for some operations (npm install).

l0rd commented 4 years ago

Another option mentioned by @gorkem is to leverage ephemeral containers introduced in 1.16. That would allow to avoid using rsync.

benoitf commented 4 years ago

Hello,

here are some notes:

about data sync pod:

The service needs to be almost always on to be sure that we don't move the slow down to start a workspace to start this data sync pod.
- based on some discussion, it may be a per user pod so the projects data are kept on the user namespace.
- based on another discussion, it may allow to spawn need pod elsewhere that are requiring for example only read-access to the /projects folder and it could be client of the data sync pod as well.

optimization

as it may require a cost to have this service always on:
- idling after some hours
- it can be started as soon as for example the user connects to the dashboard

One the goal is to be able to start the workspace as fast as possible. For that it means that we create a new workspace (no previous state):

data-sync-pod service should reply as soon as possible that there are no existing data for this given workspace. (even before trying to mount the PVC). By mounting PVC ‘asynchronously’, it means that the workspace can boot quickly, projects can be cloned and persistence of the workspace will be done later, when the PVC is up. (IDE will notify like with a status bar that it starts with ephemeral storage and then backup/persistency is there)

If there is previous data, IDE needs to wait to have project restored before displaying full layout.

Storage synchronization:

Prefer zip/tgz unpack over cp/rsync for workspace boot (like 5s to unpack a 300MB tgz file) vs 1mn30 to copy/sync files
Issue of zip/tgz: will increase the size of PVC (need to have it both unpacked and packed archive). On che.openshift.io it seems it's easier to get other PVC with 1gb than increasing the PVC size ?
have a way to exclude or not node_modules folder (or other folders) so zip/tgz file could be reduced but it implies that we download again the dependencies when workspace is restored so it may take time as well...

optimization: it could cleanup 'unpacked' folder and only keep zip files if files were not used since a lot of time.

theia enhancements :

Wait that we have ‘ready’ event before booting extensions/plug-ins and displaying UI of the IDE (should be inside the IDE but with minimalist state and display status update)
- On a fresh workspace start, if persistent volume is not yet mounted, IDE is still importing the sources but notify the user that it’s not yet persisted.
- When event ‘sync finished’ is thrown, notify the user that now the persistent mode is enabled.

Another optimization: For now, import/clone of source code is performed when we're entering into the IDE. (it's useful if some 'private' repository is accessed as we may need the github token and have oAuth, etc) but in the case of a public repository, if the project is cloned as soon as possible, it means that we could enter into Theia by having already the project cloned previously or in parallel. So it might speed up again the process. --> needs another Epic just for this specific item.

tsmaeder commented 4 years ago

Just a couple of notes:

From the kube docs:

Warning: Ephemeral containers are in early alpha state and are not suitable for production clusters.
My .m2 folder is 750MB. I suspect we might have to calculate with large amounts of data.
In order to prevent data loss, we'll have to rsync while developing. Did we measure the impact this has on the performance of development tools (doing theia yarn, for example).

benoitf commented 4 years ago

about 1. we read the docs as well but thx :-)

yes, itt is easy to unpack/transfer big chunks while transfer a lot of small chunks is slow (but here that's an easy fact) (BTW some docker images are bigger than 750MB (even compressed)
you may see different process of syncs. Dummy one would rsync all files at the same scope (user files and generated files) while you may give higher priority to files that are really user modified (source code) vs 'node_modules' folder where even if there is data loss it's less problematic. And we could skip as well all node_modules folder during 'workspace development' and only persist it when closing workspace (here user may accept to loose data but it's like every ignore, a tradeoff)

tsmaeder commented 4 years ago

about 1. we read the docs as well but thx :-)

No doubt, but the fact that we're relying on unproven technology might merit a bit of discussion, no? Is this tech that is ready for our customers?

benoitf commented 4 years ago

@tsmaeder we're not considering for now as it should work on all openshift/kubernetes instances

gazarenkov commented 4 years ago

Do we really need to have per-workspace PV attachment?

Can not we have single Data Sync service/deployment used by all the workspaces instead?

benoitf commented 4 years ago

@gazarenkov it's per user namespace (all workspaces of a user) first.

gazarenkov commented 4 years ago

@benoitf Ok, thanks, then do we really need it per-namespace? :)

benoitf commented 4 years ago

@gazarenkov at first because for example on che.openshift.io you won't be able to mount like a "super big" PV to store all workspaces data (and then how do you manage quota per user as today) and cross-clusters stuff, etc. By using per-namespace at first (but still thinking to allow one service for all users/etc) it remains in the same K8S architecture.

gazarenkov commented 4 years ago

@benoitf What are the limitations for mounting a single big PV in che.openshift.io?

l0rd commented 4 years ago

@benoitf What are the limitations for mounting a single big PV in che.openshift.io?

@gazarenkov that's trickier because the service that does the sync needs to deal with files of different users. That can be implemented as a second iteration though. But let's keep this first iteration simple and implement a PV per user.

gazarenkov commented 4 years ago

@l0rd Looks like we are on the same page regarding direction. If so, I'd suggest reconsider the strategy estimating going to single service at once, because:

it may happen that this option is simpler in fact (per-user option does not seem very simple...)
we presumably already have some solution/code to reuse (at least a single storage divided per-workspace internally, rsync mechanic for it etc has been working in codenvy.io for years...)
it may happen there will be a lot of things to reuse moving from per-user to single Data service

So, I'd definitely suggest considering single data service as an option to consider before we go to implementation.

davidfestal commented 4 years ago

Just some thoughts about this issue, the ongoing work on the Workspace CRD, and cloud shell.

According to this EPIC https://github.com/eclipse/che/issues/15425, there will be, at some point, the ability to start Che 7 workspaces in a lightweight, standalone, and embeddable way, without requiring the presence of the Che master (already demoed as a POC).

One important point mentioned in this EPIC, is the big scalability gain that would be brought, in this envisioned K8S-native architecture, by:

removing some central components like Postgres database in favor of K8S-native Custom Resources that will benefit from the highly-scalable etcd storage underpinning K8S clusters
Making some other central components (Keycloak, Che WSMaster) optional.

In the light of this, I would prefer starting this work with the option that is, as much as possible, compatible with both use-cases:

standalone workspaces (possibly directly embedded in third-party systems, such as the OpenShift console),
wsmaster-managed workspaces.

So it seems to plead for a per-user-namespace solution first. Of course this should not prevent us to extend this solution to use a central service in a second step. But requiring an additional central server to be able to start workspaces seem contrary with the architectural direction we've taken with the DevWorspace CRD and the cloud-shell.

gazarenkov commented 4 years ago

@davidfestal Could you please elaborate about your vision of a layer which persists projects code between user sessions (i.e. temporary) in a light of workspace management decentralization. I.e if in our next system we replace Che server with CRD/controller and Postrge with etcd what does that mean for physical storage for projects? How exactly it related? We are going to replace single (distributed) filesystem (based on Gluster/Seph/EBS/something else) with what?

davidfestal commented 4 years ago

@gazarenkov

Physical storage for workspace data is already per-user (if not per-workspace), through namespaced PVs, and not centralized and common to all the users. I don't see what should change here with the Workspace CRD architecure. Workspace data physical storage is already decentralized. I don't see why it would be required to change the existing way, and now store workspace data in a PV common to all users.

But even without going into all technical details here, my point was to say that requiring an additional centralized service in an architecture that finally should be compatible with workspace management decentralization, seems strange to me.

Afaict, the initial proposal from @benoitf with per-user-namespace storage, would fit the existing and future structure of the Workspace CRD POC.

But sure, a centralized workspace storage service could, at some point, be an optimization option for some use-cases.

gorkem commented 4 years ago

Wouldn't a single big PV require ReadWriteMany access mode?

gazarenkov commented 4 years ago

@gorkem I would guess RWO will work fine for single Data Store Pod, if second (and more) pods spin up - it depends whether scheduler put it on the same node (should work) or different (will not). https://github.com/kubernetes/kubernetes/issues/26567

l0rd commented 4 years ago

@gazarenkov why do you think one central service is simpler? In a centralized service we have to build a secured-to-the-bone mechanism that matches users with folders. And we need to consider scalability as well. A problem with that service and users won't be able to access their data or even worst will have access to data of other users. I don't want to deal with those problems right now.

For the reuse of existing code that's an implementation detail. I would let the team that will work on the code to decide.

gazarenkov commented 4 years ago

@l0rd my guess that it may happen that single service may be simpler based on the fact that we have an experience and working system which used this approach. The only potential problem if we run into PV/K8s infra specific limitations which will not allow us to use it (such as PV size, access mode etc).

I do not think user should have direct access to this data (which is a hot backup of projects), only via Data Sync service which supposedly can scale Pod the same way as usual K8s Deployment ?

I think it may even work w/o this service exactly the same as it does with Ephemeral storage now, i.e. user have access to the instance storage only, syncing this data is exclusively internal mechanism. That's why I do not think this storage should even know who is the owner of particular workspace, it may just deal with folders identified with workspaceId.

Additional bonus of this approach may be a zero PV attaching/mounting time (like ephemeral again).

So, to me, it looks as an option to consider before coding, no?

gorkem commented 4 years ago

About the Central Service

With the central service means we are operating outside of the kubernetes configuration. For instance, If a cluster has quotas on PVC sizes. We would not be using those.
Single service means single point of failure with service pod failing all the workspaces fail not only one user/workspace
Horizontal scaling is not possible without RWX and RWX is very rarely available making it impractical.
Single point of compromise and as @l0rd hinted earlier solution would need to implement almost the same things that a k8s storage driver does in means of access control during PVC attach. I do understand the argument that only Data Sync service would have the access but ignoring what would happen if that is compromised internally or externally, that also means that we will no longer be part of k8s and storage backend auditing and have to implement our own audit logs.

Some other considerations

Ephemeral storage emptyDir is not infinite and it is shared with an unknown number of services with varying nature that can be scheduled to the same node. So we can say that this solution is going to have an un-deterministic character and will require better ever handling.
Our usage of the ephemeral storage now adds node storage a factor on the number of workspaces that can be run per node. We need to document this effect.

l0rd commented 3 years ago

@ibuziuk was there something left here or we can close the epic?

che-bot commented 2 years ago

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

eclipse-che / che