SSH agent initialization postStart event fails workspace if $HOME is not writable

AObuchow commented 3 weeks ago

Description

In container images where the $HOME directory is not writable (such as the go-toolset image, which has $HOME=/opt/app-root/src/) the init-ssh-agent-command-... postStart event will fail.

This is due to the fact we assume $HOME/ssh-environment is writable when doing ssh-agent | sed 's/^echo/#echo/' > $SSH_ENV_PATH, however this is not always the case:

$ ssh-agent |  sed 's/^echo/#echo/' > $SSH_ENV_PATH
bash: /opt/app-root/src/ssh-environment: Permission denied

Maybe we should wrap the entire ssh-agent intialization command with a (...) || true so that regardless of wether a specific step of the ssh-agent initialization fails, the workspace will start up. This is the approach taken for the init-persistent-home preStart event.

How To Reproduce

Set up an SSH key with a passphrase
Create a workspace using the go-toolset image:

kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: plain-devworkspace
spec:
  started: true
  routingClass: 'basic'
  template:
    components:
      - name: web-terminal
        container:
          image: registry.access.redhat.com/ubi9/go-toolset:1.19.13-4.1697647145
          memoryRequest: 256Mi
          memoryLimit: 512Mi
          mountSources: true
          command:
           - "tail"
           - "-f"
           - "/dev/null"

Check the workspace status, it should have failed: oc get devworkspace -n $NAMESPACE

Expected behavior

The workspace should succceed to start up. Whether the automatic SSH passphrase provisioning functionality works is another topic (maybe we should set SSH_ENV_PATH=/tmp/ssh-environment instead of SSH_ENV_PATH=$HOME/ssh-environment?)

Additional context

Upstream Che Issue

l0rd commented 3 weeks ago

@AObuchow I have the same issue but that's not related to $HOME not being writable, in my case, the second container is distroless so it doesn't have sh or true. Any postStart event will fail in that container. But the init-ssh-agent-command is injected in both containers.

Some ideas:

There should be a documented attribute to opt-out from automatically injected post start events
If there are multiple components the event should be added for the first component only
If mountSources: false there should be no need to add the init-ssh-agent-command-... postStart event

That's my DevWorkspace with 2 containers:

apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: outyet-dw
spec:
  contributions:
    - components:
        - container:
            env:
              - name: CODE_HOST
                value: 0.0.0.0
          name: che-code-runtime-description
      name: che-code
      uri: https://eclipse-che.github.io/che-plugin-registry/main/v3/plugins/che-incubator/che-code/latest/devfile.yaml
  started: true
  template:
    components:
      - container:
          image: ghcr.io/l0rd/outyet-dev:latest
          memoryRequest: 2G
          memoryLimit: 10G
          cpuRequest: '1'
          cpuLimit: '4'
          mountSources: true
        name: dev
      - container:
          cpuLimit: 500m
          image: ghcr.io/l0rd/outyet@sha256:b83e158687d6cb3d7ae46382be1e4fbb8eb3572f3423a9c3c9beae6cd55cc0e8
          memoryLimit: 128Mi
          mountSources: false
        name: outyet
    projects:
      - git:
          remotes:
            origin: https://github.com/l0rd/outyet.git
        name: outyet
        sourceType: Git

l0rd commented 3 weeks ago

@vinokurig, your PR does not fix the use case above. I would appreciate it if you could address that as well.

vinokurig commented 3 weeks ago

@l0rd

There should be a documented attribute to opt-out from automatically injected post start events

I think this is not related to the current issue and can be done separately.

If there are multiple components the event should be added for the first component only

We can not be sure that the dev component will always be first in the list.

If mountSources: false there should be no need to add the init-ssh-agent-command-... postStart event

Why not to handle the ssh passphrase for such components?

AObuchow commented 3 weeks ago

@l0rd Thank you for the feedback on this issue, it's always appreciated to see how you're using DevWorkspace Operator :)

In general, I think your issue is outside the scope of the current issue (as it seems more related to postStart events in general, rather than the DWO-injected SSH agent initialization postStart event). I'll go ahead and create a new DWO issue for your topic.

I agree with @vinokurig's responses to your suggestions. An opt-out attribute for automatically injected postStart events would probably be best for container images where running postStart events is impossible.

I inspected the ghcr.io/l0rd/outyet:latest image (which is based on golang:onbuild) and saw that it does have sh but it's located in /usr/bin/sh instead of /bin/sh which is what DWO expects.

This suggests another improvement to DWO's postStart events: maybe we can use command (which is POSIX-compliant) to determine how to invoke sh if is available on $PATH? Edit: Nevermind... I don't think it's feasible to use command -v sh to test whether sh is available before executing the rest of the postStart event. However, maybe we can just invoke shand have it be found on $PATH instead of hard-coding the location of sh?

Unfortunately, if someone uses a scratch image that does not even have command, the postStart event will still cause the container to fail. In this case, the best solution would probably be to allow users to opt-out of the DWO-injected postStart events.

l0rd commented 3 weeks ago

We can not be sure that the dev component will always be first in the list.

@vinokurig The convention is to consider the first component the dev component. We contribute the editor in the first component, not in every component. There should be an attribute to specify a component that is not the first one as the dev component.

Why not to handle the ssh passphrase for such components?

@vinokurig isn't the SSH passphrase mainly used for git authentication? If that's not you can ignore my comment

I inspected the ghcr.io/l0rd/outyet:latest image

@AObuchow you have inspected the wrong image. In the meantime I have changed ghcr.io/l0rd/outyet:latest base image to make my scenario work (it used to work and that's what I am going to demo at KubeCon in a couple of weeks so I fixed it with a bigger but working image). But I had updated the example above with the distro-less image ghcr.io/l0rd/outyet@sha256:b83e158687d6cb3d7ae46382be1e4fbb8eb3572f3423a9c3c9beae6cd55cc0e8.

In general I think that #1307 was a bad idea because there is a high risk to break existing workspaces (we just found 2 container images where the command failed, there may be more). And no matter the value of the new feature implemented, avoiding breaking existing workspaces should be the priority. Also we should avoid, at the DWO level, to modify a DW applied by a user. The dashboard does an automatic generation of DW, so why not adding the post start event there?

ibuziuk commented 3 weeks ago

@l0rd got a question about https://github.com/devfile/devworkspace-operator/issues/1337#issuecomment-2450601646 is it reproducible for you even without SSH being configured?

l0rd commented 3 weeks ago

@ibuziuk I am reproducing it on developer sandbox, without any specific config on my side.

vinokurig commented 3 weeks ago

@l0rd @ibuziuk @AObuchow To summarize: from the discussion I see next issues:

$HOME directory is not writable, the related pull request addresses the particular issue.
sh might not be available. This might affect all workspaces regardless the presence of the ssh key configmap.
Do not add the init-ssh-agent post start event to all containers except the first one.

I believe that [2] and [3] issues should be addresses in a separate pull requests. I am going to create an issue for [2] but I am not sure if we need one for [3]?

l0rd commented 3 weeks ago

@vinokurig sure, I am perfectly fine with a separate PR, of course.

AObuchow commented 3 weeks ago

@l0rd @ibuziuk @AObuchow To summarize: from the discussion I see next issues:
1. `$HOME` directory is not writable, the related pull request addresses the particular issue.

2. `sh` might not be available. This might affect all workspaces regardless the presence of the ssh key configmap.

3. Do not add the `init-ssh-agent` post start event to all containers except the first one.
I believe that [2] and [3] issues should be addresses in a separate pull requests. I am going to create an issue for [2] but I am not sure if we need one for [3]?

@vinokurig Thank you for summarizing the current state of things.

I am going to create an issue for [2]

I'm happy to create an issue for this since it's past your working hours in your current timezone. I will also start investigating potential solutions to solve this issue.

Regarding [3]:

Do not add the init-ssh-agent post start event to all containers except the first one.

Though I agree with @l0rd's comment that the convention is to consider the first component as the main tooling component:

The convention is to consider the first component the dev component. We contribute the editor in the first component, not in every component. There should be an attribute to specify a component that is not the first one as the dev component.

We had an issue CRW-6614 where on the DevSandbox the first component in the devfile was not always the tooling component. I'm still confused as to why this was the case, as no reproducer devfile was left in that Jira and there's only screenshots that show this repo which does have the tooling container as the first component.

However: when an SSH key & passphrase is provided via the automount configmap/secret mechanism, it is mounted to all container components (not just the tooling component). Thus, it seems natural to also allow for automatic SSH passphrase provisioning with the ssh-agent in all container components.

So I do not think we should only provide the ssh-agent postStart event to the first container component.

l0rd commented 3 weeks ago

We had an issue CRW-6614 where on the DevSandbox the first component in the devfile was not always the tooling component. I'm still confused as to why this was the case, as no reproducer devfile was left in that Jira and there's only screenshots that show this repo which does have the tooling container as the first component.

Using the attribute controller.devfile.io/merge-contribution: true/false should allow you to control in which component the editor is contributed.

Thus, it seems natural to also allow for automatic SSH passphrase provisioning with the ssh-agent in all container components.

Ideally, that may be right. But considering that it can generate a workspace startup error, making using Che / Dev Spaces impossible, I would be very cautious. You should find a way to ignore any error in the passphrase provisioning or execute only when it's necessary and providing an option to opt out.

I understand that's not an easy task 🤷 and I don't have any magical solution. However, considering the maturity of Che / Dev Spaces, any new feature that introduces risks of breaking existing workspaces (with no workaround) should be avoided or reverted in my opinion.

vinokurig commented 3 weeks ago

According to the kubernetes documentation there is no way to ignore the post start event failure: If either a PostStart or PreStop hook fails, it kills the Container.

Although we can increase the stability of the workspace start with next solutions:

Do not add the post start event if the ssh secret is not defined
- The ssh key secret that maps the keys to the container filesystem, may have another from conventional name, so we can not detect them.
- If the ssh key has a passphrase set up, the workspace start will continue to fail due to a vulnerable container.
Add the post start event only to the first container component.
- I agree with @AObuchow
  
  However: when an SSH key & passphrase is provided via the automount configmap/secret mechanism, it is mounted to all container components (not just the tooling component). Thus, it seems natural to also allow for automatic SSH passphrase provisioning with the ssh-agent in all container components.
  
  So I do not think we should only provide the ssh-agent postStart event to the first container component.
Have an additional configuration to opt-out from all the init-ssh-agent post start event.
- This overcomplicates the flow and would be not very obvious for users.

AObuchow commented 3 weeks ago

Ideally, that may be right. But considering that it can generate a workspace startup error, making using Che / Dev Spaces impossible, I would be very cautious. You should find a way to ignore any error in the passphrase provisioning or execute only when it's necessary and providing an option to opt out.

I understand that's not an easy task 🤷 and I don't have any magical solution. However, considering the maturity of Che / Dev Spaces, any new feature that introduces risks of breaking existing workspaces (with no workaround) should be avoided or reverted in my opinion.

+1 Agreed. Allowing this feature to modify the DevWorkspace unconditionally (and potentially break existing workspaces) was something I should have avoided.

AObuchow commented 3 weeks ago

Do not add the post start event if the ssh secret is not defined

The ssh key secret that maps the keys to the container filesystem, may have another from conventional name, so we can not detect them.

If the ssh key has a passphrase set up, the workspace start will continue to fail due to a vulnerable container.

Agreed, unfortunately the workaround/fix I currently have provided only resolves the issue for Che/DevSpaces specifically.

Have an additional configuration to opt-out from all the init-ssh-agent post start event.

This overcomplicates the flow and would be not very obvious for users.

Instead of opt-out, I think an opt-in to enable the SSH agent feature (as a devfile & devworkspace attribute) might work. This is something I propose in the long term solution idea of this issue. The Che dashboard would be responsible for setting this attribute. For non-Che users, they would unfortunately have to consult the DWO documentation to discover this attribute, but at least their workspaces would not potentially fail from having this feature enabled by default.

AObuchow commented 3 weeks ago

Currently, @ibuziuk and I are thinking of guarding the injection of the SSH agent postStart event under the DWOC's config.enableExperimentalFeatures field until we can sort out how to handle this feature in the most graceful way.

ibuziuk commented 3 weeks ago

given that the feature is risky I believe it would be a really good move to enable it based on the DWOC config. on UI we can also articulate that this is experimental - https://github.com/eclipse-che/che-dashboard/pull/1249/files @l0rd @vinokurig @dmytro-ndp thoughts?

devfile / devworkspace-operator