Closed dkwon17 closed 4 months ago
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Some cases to think about in general:
["tail", "-f", "/dev/null"]
in it's entry point, which would cause the init container to never terminate Amazing work so far David. I think this is a good proposed solution to fixing the race condition between the UDI's entrypoint and any post-start commands.
There are a lot of considerations to make so my thoughts are a bit all over the place, but here goes:
This is something you mentioned in your earlier draft. I think it makes the most sense for the init container component to be added in DWO:
/home/user/
as a volume assumes that $HOME == /home/user/
, which is the case for the UDI but not necessarily true for other user-provided images.
I don't think we want to call the UDI's entrypoint directly, as there's other code in it that is irrelevant to home persistence. Instead, we'll want to move the stow-related code into its own script and execute that.
This script could potentially reside in the UDI under a pre-determined name (e.g./home-persistence-init.sh
), or it could be injected by DWO as the initContainer's command. In the latter case, we could test whether stow is installed in the container before calling it.
Instead of depending on the first container component in the devworkspace being the UDI & having stow installed, we could ship a minimal image that has stow installed (similar to how we ship the project clone container).
This would simplify the inferInitContainer()
step, but require us to productize a new container downstream that we'd be shipping.
For a first pass of your proposed change, I think it's fine to require users to use a main/tooling image based on the UDI (with stow installed, as well as tooling installed in /home/tooling/ and $HOME set to /home/user/). This is already the case for $HOME persistence to work as expected, anyhow.
EDIT: This idea I had of having a dedicated init container image that provides the stow binary and executes it doesn't make sense, as there is no way to mount the/home/tooling/
directory from the tooling/main container to the init container image. Ignore this point
Some cases to think about in general:
* User's main container image does not have the /entrypoint.sh script * User's main container image has `["tail", "-f", "/dev/null"]` in it's entry point, which would cause the init container to never terminate
Both of these can be resolved by having DWO call stow and cp directly (by default), rather than requiring them to be in the entrypoint. We could then let users configure the init container command through the DWOC if they don't want to use the default we provide.
Edit: one thing we'll have to consider if we move the stow step outside of the entrypoint, the stow step will now be executed before the kubedock step, which might break kubedock (stow is currently creating a symlink from /home/tooling/.local/bin/podman -> /home/user/.local/bin/podman
). I'm hopeful this won't actually be a problem though, as /home/tooling/.local/bin/
should be on $PATH.
Sorry, the PR still had rough work, I've still updated the PR recently with some of your feedback
@dkwon17 since we're adding new fields to the DWOC, you'll have to modify mergeConfig()
and GetCurrentConfigString()
in sync.go to account for the new fields.
Additionally, the CRDs will have to be re-generated by running the following and adding the resulting changes into a separate commit: make generate manifests fmt generate_default_deployment generate_olm_bundle_yaml
.
Since we're still working through the details of this PR, these 2 steps can wait until we're set on the new DWOC fields.
@dkwon17
I've noticed, that after enabling persistUserHome
, all content from /home/user
is vanished that it has been there before
@tolusha could you provide a reproducer? Does this happen when enabling persistUserHome
and starting an existing workspace?
@tolusha: changing LGTM is restricted to collaborators
@dkwon17 your latest changes look good to me :) I still need to do some final manual testing on a cluster but I think this is just about ready to merge.
My 2 last thoughts on this PR:
/home/tooling/
(where the Podman wrapper exists) is on PATHThanks for the feedback, I was able to verify the kubedock functionality [1] in the UDI for this PR
Thanks for the feedback, I was able to verify the kubedock functionality [1] in the UDI for this PR
@dkwon17 awesome, thank you for verifying this. I'll do some final testing and have this merged.
Sounds good @AObuchow , do you want me to squash the commits and force push?
@dkwon17 yes please squash your commits then force push if you get a chance. I know you're going to be on PTO however, so I can do this for you.
@dkwon17 thanks for squashing your commits. Will do final review & have this merged by Wednesday morning :)
Note to myself: need to create a DWO issue before this is merged for tracking in the 0.28 release miletstones
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: AObuchow, dkwon17, ibuziuk, svor, tolusha
The full list of commands accepted by this bot can be found here.
The pull request process is described here
What does this PR do?
The problem
Today, the persistent home setup occurs in the entrypoint [1] of the tooling container (assuming it is based on the UBI). Notably, the entrypoint runs stow [2] to make symbolic links for the home directory to be saved on the PVC.
If a poststart event depends on files or edits files located in the home directory, there is a race condition because the poststart event may start running before stow has finished running. Stow takes about 6 seconds to complete.
This race condition causes https://github.com/eclipse-che/che/issues/22914, because the checode poststart event runs before the entrypoint script completes.
Proposed solution
To solve the race condition altogether, the entrypoint script must be complete before the post start events run. To do this, an initcontainer that runs the entrypoint script can be used.
What is the container image for the initcontainer?
To have a minimal effect on workspace startup times, the container image for the initcontainer should be the same image as the tooling container's image to avoid pulling an extra image. Assuming that the image is already pulled, I've observed no significant startup impact (see demo video in the next section) when running an extra container running the tooling image, even if the tooling image is large (quay.io/devfile/universal-developer-image:ubi8-latest has a compressed size of 2.9GB).
When thinking about it in the Docker perspective, running a container is only creating a writable layer on top of the already-pulled image layers, therefore if the image is already pulled the image size would not have a significant impact on container startup.
I'm guessingthis is what also happens in OpenShift/CRI-O, because adding the initcontainer did not increase workspace startup time significantly.What issues does this PR fix or reference?
https://github.com/eclipse-che/che/issues/22914 Fixes https://github.com/devfile/devworkspace-operator/issues/1260
Demo: https://youtu.be/RQeKKJgaogM
In the demo above, the init container named
init-persistent-home
runs for than 9 seconds at the 0:20 second mark. Stow is only run once (because of this if statement [3]) to set up the new PVC.So for subsequent workspace starts, the
init-persistent-home
init container takes less than a second to finish. Here is a second demo showing that:https://youtu.be/hKnb1mgBSqA?si=yuJ3EKmj_dX0qHj0&t=20
Is it tested? How?
To test this PR with Eclipse Che:
Install DWO on your cluster by adding this catalog source and installing DWO from OperatorHub:
After waiting a few minutes, the DWO from the new catalog should appear in OperatorHub. Install the one from
DevWorkspace Operator Catalog
.Install Eclipse Che with chectl:
Enabled persistent user home by adding the following in the CheCluster CR:
Create a workspace with the following repositories and confirm that the
podman info
command runs without errors after workspace startup (just like in the demo video):PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with CheIs it tested? How?
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che