Create an initcontainer for persistent home setup

dkwon17 commented 5 months ago

What does this PR do?

The problem

Today, the persistent home setup occurs in the entrypoint [1] of the tooling container (assuming it is based on the UBI). Notably, the entrypoint runs stow [2] to make symbolic links for the home directory to be saved on the PVC.

If a poststart event depends on files or edits files located in the home directory, there is a race condition because the poststart event may start running before stow has finished running. Stow takes about 6 seconds to complete.

This race condition causes https://github.com/eclipse-che/che/issues/22914, because the checode poststart event runs before the entrypoint script completes.

Proposed solution

To solve the race condition altogether, the entrypoint script must be complete before the post start events run. To do this, an initcontainer that runs the entrypoint script can be used.

What is the container image for the initcontainer?

To have a minimal effect on workspace startup times, the container image for the initcontainer should be the same image as the tooling container's image to avoid pulling an extra image. Assuming that the image is already pulled, I've observed no significant startup impact (see demo video in the next section) when running an extra container running the tooling image, even if the tooling image is large (quay.io/devfile/universal-developer-image:ubi8-latest has a compressed size of 2.9GB).

When thinking about it in the Docker perspective, running a container is only creating a writable layer on top of the already-pulled image layers, therefore if the image is already pulled the image size would not have a significant impact on container startup. ~~I'm guessing~~ this is what also happens in OpenShift/CRI-O, because adding the initcontainer did not increase workspace startup time significantly.

What issues does this PR fix or reference?

https://github.com/eclipse-che/che/issues/22914 Fixes https://github.com/devfile/devworkspace-operator/issues/1260

Demo: https://youtu.be/RQeKKJgaogM

In the demo above, the init container named init-persistent-home runs for than 9 seconds at the 0:20 second mark. Stow is only run once (because of this if statement [3]) to set up the new PVC.

So for subsequent workspace starts, the init-persistent-home init container takes less than a second to finish. Here is a second demo showing that:

https://youtu.be/hKnb1mgBSqA?si=yuJ3EKmj_dX0qHj0&t=20

Is it tested? How?

To test this PR with Eclipse Che:

Install DWO on your cluster by adding this catalog source and installing DWO from OperatorHub:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: devworkspace-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/dkwon17/devworkspace-operator-index:init-container
  publisher: Red Hat
  displayName: DevWorkspace Operator Catalog
  updateStrategy:
    registryPoll:
      interval: 5m

After waiting a few minutes, the DWO from the new catalog should appear in OperatorHub. Install the one from DevWorkspace Operator Catalog.

Install Eclipse Che with chectl:

chectl server:deploy -p openshift --skip-devworkspace-operator

Enabled persistent user home by adding the following in the CheCluster CR:

spec:
  devEnvironments:
    persistUserHome:
      enabled: true

Create a workspace with the following repositories and confirm that the podman info command runs without errors after workspace startup (just like in the demo video):

PR Checklist

[ ] E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- [ ] v8-devworkspace-operator-e2e: DevWorkspace e2e test
- [ ] v8-che-happy-path: Happy path for verification integration with Che

Is it tested? How?

PR Checklist

[ ] E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- [ ] v8-devworkspace-operator-e2e: DevWorkspace e2e test
- [ ] v8-che-happy-path: Happy path for verification integration with Che

openshift-ci[bot] commented 5 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

dkwon17 commented 5 months ago

Some cases to think about in general:

User is using a parent devfile (works, tested with this child devfile)
User's main container image does not have the /entrypoint.sh script
User's main container image has ["tail", "-f", "/dev/null"] in it's entry point, which would cause the init container to never terminate

AObuchow commented 5 months ago

Amazing work so far David. I think this is a good proposed solution to fixing the race condition between the UDI's entrypoint and any post-start commands.

There are a lot of considerations to make so my thoughts are a bit all over the place, but here goes:

Should we create the home persistence init container component in DWO or Che Dashboard?

This is something you mentioned in your earlier draft. I think it makes the most sense for the init container component to be added in DWO:

We want to minimize the transformation of a devfile -> devworkspace, so that DWO's devfile bootstrap feature continues to work. If the Che-Dashboard was adding an extra component to the devworkspace for home persistence, this extra component would be lost when DWO restarts the devworkspace from the cloned devfile (as the bootstrap feature will use components from the devfile, not the devworkspace)
I'm not sure if the Che-Dashboard can currently "read" the Che Cluster CR in order to determine the home persistence settings? The dashboard would have to watch for changes in the Che Cluster CR or get re-deployed (with the configured home persistence settings) when the home persistence configuration changes.
There's already some level of coupling between the UDI and DWO for home persistence: mounting /home/user/ as a volume assumes that $HOME == /home/user/, which is the case for the UDI but not necessarily true for other user-provided images.
- Since there's already coupling between DWO and the UDI, we should keep this the same rather than move the coupling to be between the UDI and Dashboard.

Should we call the main/init containers entrypoint? Or just the code relevant to home persistence (stow, cp)?

I don't think we want to call the UDI's entrypoint directly, as there's other code in it that is irrelevant to home persistence. Instead, we'll want to move the stow-related code into its own script and execute that.

This script could potentially reside in the UDI under a pre-determined name (e.g./home-persistence-init.sh), or it could be injected by DWO as the initContainer's command. In the latter case, we could test whether stow is installed in the container before calling it.

Should we hope that the main/tooling component has stow installed? Or consider shipping a dedicated init container with stow installed.

Instead of depending on the first container component in the devworkspace being the UDI & having stow installed, we could ship a minimal image that has stow installed (similar to how we ship the project clone container).

This would simplify the inferInitContainer() step, but require us to productize a new container downstream that we'd be shipping.

For a first pass of your proposed change, I think it's fine to require users to use a main/tooling image based on the UDI (with stow installed, as well as tooling installed in /home/tooling/ and $HOME set to /home/user/). This is already the case for $HOME persistence to work as expected, anyhow.

EDIT: This idea I had of having a dedicated init container image that provides the stow binary and executes it doesn't make sense, as there is no way to mount the/home/tooling/ directory from the tooling/main container to the init container image. Ignore this point

AObuchow commented 5 months ago

Some cases to think about in general:

* User's main container image does not have the /entrypoint.sh script

* User's main container image has `["tail", "-f", "/dev/null"]` in it's entry point, which would cause the init container to never terminate

Both of these can be resolved by having DWO call stow and cp directly (by default), rather than requiring them to be in the entrypoint. We could then let users configure the init container command through the DWOC if they don't want to use the default we provide.

Edit: one thing we'll have to consider if we move the stow step outside of the entrypoint, the stow step will now be executed before the kubedock step, which might break kubedock (stow is currently creating a symlink from /home/tooling/.local/bin/podman -> /home/user/.local/bin/podman). I'm hopeful this won't actually be a problem though, as /home/tooling/.local/bin/ should be on $PATH.

dkwon17 commented 5 months ago

Sorry, the PR still had rough work, I've still updated the PR recently with some of your feedback

AObuchow commented 5 months ago

@dkwon17 since we're adding new fields to the DWOC, you'll have to modify mergeConfig() and GetCurrentConfigString()in sync.go to account for the new fields.

Additionally, the CRDs will have to be re-generated by running the following and adding the resulting changes into a separate commit: make generate manifests fmt generate_default_deployment generate_olm_bundle_yaml.

Since we're still working through the details of this PR, these 2 steps can wait until we're set on the new DWOC fields.

tolusha commented 5 months ago

@dkwon17 ~~I've noticed, that after enabling persistUserHome, all content from /home/user is vanished that it has been there before~~

dkwon17 commented 5 months ago

@tolusha could you provide a reproducer? Does this happen when enabling persistUserHome and starting an existing workspace?

openshift-ci[bot] commented 5 months ago

@tolusha: changing LGTM is restricted to collaborators

In response to [this](https://github.com/devfile/devworkspace-operator/pull/1251#pullrequestreview-2035873766): > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

AObuchow commented 4 months ago

@dkwon17 your latest changes look good to me :) I still need to do some final manual testing on a cluster but I think this is just about ready to merge.

My 2 last thoughts on this PR:

We need to verify this doesn't break the Podman wrapper in the UDI (i.e. kubedock) since now Stow is being invoked before the podman wrapper is symlinked. However, as we briefly discussed on call, I don't think this is an issue since /home/tooling/ (where the Podman wrapper exists) is on PATH
It'd be nice to squash some of the commits from this PR together to create an easier-to-follow git log. My suggestion is that you could probably just have ~3 commits: one for the feature implementation, one for changing the CRDs/resources and one for adding tests. I know some commits (like this one) change both the implementation and test cases, so don't worry if these remain their own separate commits.

dkwon17 commented 4 months ago

Thanks for the feedback, I was able to verify the kubedock functionality [1] in the UDI for this PR

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_dev_spaces/3.9/html/release_notes_and_known_issues/new-features#enhancement-crw-3367

AObuchow commented 4 months ago

Thanks for the feedback, I was able to verify the kubedock functionality [1] in the UDI for this PR

[1] access.redhat.com/documentation/en-us/red_hat_openshift_dev_spaces/3.9/html/release_notes_and_known_issues/new-features#enhancement-crw-3367

@dkwon17 awesome, thank you for verifying this. I'll do some final testing and have this merged.

dkwon17 commented 4 months ago

Sounds good @AObuchow , do you want me to squash the commits and force push?

AObuchow commented 4 months ago

@dkwon17 yes please squash your commits then force push if you get a chance. I know you're going to be on PTO however, so I can do this for you.

AObuchow commented 4 months ago

@dkwon17 thanks for squashing your commits. Will do final review & have this merged by Wednesday morning :)

AObuchow commented 4 months ago

Note to myself: need to create a DWO issue before this is merged for tracking in the 0.28 release miletstones

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AObuchow, dkwon17, ibuziuk, svor, tolusha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/devfile/devworkspace-operator/blob/main/OWNERS)~~ [AObuchow,dkwon17] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

devfile / devworkspace-operator