canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Jupyter notebook is not creating with volume set to ReadWriteMany #840

Closed natalytvinova closed 8 months ago

natalytvinova commented 8 months ago

Bug Description

While creating a Jupyter Notebook, I created a volume with the option "ReadWriteMany". The logs are in the end

The volume is created in the Kubernetes: $ kubectl get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE admin uat-workspace Bound pvc-ae4e34b4-da07-4576-81b0-fd2a9c2a248a 20Gi RWX csi-cinder-default 13m $ kubectl get pv -A NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-ae4e34b4-da07-4576-81b0-fd2a9c2a248a 20Gi RWX Delete Bound admin/uat-workspace csi-cinder-default 13m

To Reproduce

  1. juju deploy bundle 1.8
  2. juju refresh istio-pilot --channel latest/edge/pr-381 --trust --config default-gateway=kubeflow
  3. juju refresh oidc-gatekeeper --channel latest/edge/pr-135 --trust
  4. juju config dex-auth public-url=https://my.domain.com/
  5. juju config oidc-gatekeeper public-url=https://my.domain.com/
  6. juju config istio-pilot domain-name=my.domain.com
  7. juju deploy self-signed-certificates
  8. juju relate self-signed-certificates istio-pilot
  9. apply workaround for this bug https://github.com/canonical/admission-webhook-operator/issues/126
  10. create a jupyter-notebook with a volume with ReadWriteMany

Environment

Kubeflow bundle 1.8 Juju 3.1.7 Charmed Kubernetes 1.28 Kubernetes is on top of Openstack Yoga

Relevant Log Output

[W 2024-02-23 08:31:28.458 ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
[I 2024-02-23 08:31:28.468 ServerApp] Package jupyterlab took 0.0000s to import
[I 2024-02-23 08:31:28.471 ServerApp] Package jupyter_server_fileid took 0.0026s to import
[I 2024-02-23 08:31:28.473 ServerApp] Package jupyter_server_mathjax took 0.0014s to import
[I 2024-02-23 08:31:28.479 ServerApp] Package jupyter_server_terminals took 0.0050s to import
[I 2024-02-23 08:31:28.507 ServerApp] Package jupyter_server_ydoc took 0.0269s to import
[I 2024-02-23 08:31:28.536 ServerApp] Package jupyterlab_git took 0.0291s to import
[I 2024-02-23 08:31:28.537 ServerApp] Package nbclassic took 0.0000s to import
[W 2024-02-23 08:31:28.539 ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
[I 2024-02-23 08:31:28.540 ServerApp] Package nbdime took 0.0000s to import
[I 2024-02-23 08:31:28.540 ServerApp] Package notebook_shim took 0.0000s to import
[W 2024-02-23 08:31:28.540 ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
[I 2024-02-23 08:31:28.547 ServerApp] jupyter_server_fileid | extension was successfully linked.
[I 2024-02-23 08:31:28.554 ServerApp] jupyter_server_mathjax | extension was successfully linked.
[I 2024-02-23 08:31:28.559 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-02-23 08:31:28.564 ServerApp] jupyter_server_ydoc | extension was successfully linked.
[I 2024-02-23 08:31:28.571 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-02-23 08:31:28.571 ServerApp] jupyterlab_git | extension was successfully linked.
[I 2024-02-23 08:31:28.577 ServerApp] nbclassic | extension was successfully linked.
[I 2024-02-23 08:31:28.577 ServerApp] nbdime | extension was successfully linked.
[W 2024-02-23 08:31:28.578 ServerApp] notebook_shim | error linking extension: [Errno 13] Permission denied: '/home/jovyan/.local'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 633, in get
        value = obj._trait_values[self.name]
                ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
    KeyError: 'browser_open_file'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 633, in get
        value = obj._trait_values[self.name]
                ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
    KeyError: 'runtime_dir'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/manager.py", line 342, in link_extension
        extension.link_all_points(self.serverapp)
      File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/manager.py", line 224, in link_all_points
        self.link_point(point_name, serverapp)
      File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/manager.py", line 214, in link_point
        point.link(serverapp)
      File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/manager.py", line 136, in link
        linker(serverapp)
      File "/opt/conda/lib/python3.11/site-packages/notebook_shim/nbserver.py", line 109, in _link_jupyter_server_extension
        members = diff_members(serverapp, nbapp)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/notebook_shim/nbserver.py", line 62, in diff_members
        m1 = public_members(obj1)
             ^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/notebook_shim/nbserver.py", line 56, in public_members
        members = inspect.getmembers(obj)
                  ^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/inspect.py", line 595, in getmembers
        return _getmembers(object, predicate, getattr)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/inspect.py", line 573, in _getmembers
        value = getter(object, key)
                ^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 688, in __get__
        return t.cast(G, self.get(obj, cls))  # the G should encode the Optional
                         ^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 636, in get
        default = obj.trait_defaults(self.name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 1900, in trait_defaults
        return t.cast(Sentinel, self._get_trait_default_generator(names[0])(self))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 1243, in __call__
        return self.func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/jupyter_server/serverapp.py", line 1606, in _default_browser_open_file
        return os.path.join(self.runtime_dir, basename)
                            ^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 688, in __get__
        return t.cast(G, self.get(obj, cls))  # the G should encode the Optional
                         ^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 636, in get
        default = obj.trait_defaults(self.name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 1900, in trait_defaults
        return t.cast(Sentinel, self._get_trait_default_generator(names[0])(self))
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/jupyter_core/application.py", line 110, in _runtime_dir_default
        ensure_dir_exists(rd, mode=0o700)
      File "/opt/conda/lib/python3.11/site-packages/jupyter_core/utils/__init__.py", line 26, in ensure_dir_exists
        os.makedirs(path, mode=mode)
      File "<frozen os>", line 215, in makedirs
      File "<frozen os>", line 215, in makedirs
      File "<frozen os>", line 215, in makedirs
      File "<frozen os>", line 225, in makedirs
    PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 633, in get
    value = obj._trait_values[self.name]
            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/jupyter-lab", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/application.py", line 607, in launch_instance
    serverapp = cls.initialize_server(argv=args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_server/extension/application.py", line 577, in initialize_server
    serverapp.initialize(
  File "/opt/conda/lib/python3.11/site-packages/traitlets/config/application.py", line 117, in inner
    return method(app, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_server/serverapp.py", line 2602, in initialize
    self.init_configurables()
  File "/opt/conda/lib/python3.11/site-packages/jupyter_server/serverapp.py", line 1912, in init_configurables
    "connection_dir": self.runtime_dir,
                      ^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 688, in __get__
    return t.cast(G, self.get(obj, cls))  # the G should encode the Optional
                     ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 636, in get
    default = obj.trait_defaults(self.name)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/traitlets/traitlets.py", line 1900, in trait_defaults
    return t.cast(Sentinel, self._get_trait_default_generator(names[0])(self))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/jupyter_core/application.py", line 110, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/opt/conda/lib/python3.11/site-packages/jupyter_core/utils/__init__.py", line 26, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 225, in makedirs
PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'

Additional Context

No response

syncronize-issues-to-jira[bot] commented 8 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5373.

This message was autogenerated

natalytvinova commented 8 months ago

This also happens on ReadOnlyMany volumes for notebooks

kimwnasptd commented 8 months ago

@natalytvinova does it happen for ReadWriteOnce PVCs/volumes for notebooks?

The problem above is that

  1. the RWX PVC ends up being mounted and owned by root/root user
  2. the non-root user inside the notebook does not have permission to write in that mounted folder

The Notebook's pod has set .spec.securityContext = 100 which should ensure that the mounted PVCs are owned by the expected group. This is the case and how the RWO PVCs are read/write-able but seems to not be the case for RWX PVCs

kimwnasptd commented 8 months ago

From the K8s docs I see that .spec.securityContext controls:

A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:

The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. > The permission bits are OR'd with rw-rw----

If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context

So most probably this has to do with the StorageClass and kubelet is unable to change the GID of the mounted RWX PVC. So we'll need to better understand the storage provider, which in this case I understand it's Cinder

natalytvinova commented 8 months ago

Hi @kimwnasptd nope ReadWriteOnce volumes are okay

Yes, in this case it is Cinder I checked the configs for openstack-integrator, openstack-cloud-controller and cinder-csi charms and nothing seems related to this

kimwnasptd commented 8 months ago

Adding more context here after some exploration. Also thanks to @addyess for his help looking through the CSI code and issues.

First of all, Kubeflow Notebooks container .spec.securityContext = 100 in their PodSpec. This field tells kubernetes (kubelet) what GID and UID to use for the volume it mounts on the Pod

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context-1

special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:

The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. > The permission bits are OR'd with rw-rw----

If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.

The problem in this case is that upstream Cinder CSI driver does not support fsGroup for RWX volumes https://github.com/kubernetes/cloud-provider-openstack/issues/2075, and that's why the RWX volume ends up being mounted as root/root.

Lastly, for reference, our Charms that create the cinder-csi-default StorageClass is this one https://github.com/canonical/cinder-csi-operator/blob/32c9361fcd3067c99ff4ba2a844d9dd12f2b7d36/src/storage_manifests.py#L106

kimwnasptd commented 8 months ago

So what happens in this case is:

  1. The CSI Driver for volumes is handled by the cinder-csi-operator, and creates the cinder-csi-default storage class
  2. The upstream CSI driver does not support fsGroup for RWX volumes
  3. The Kubeflow notebook pod is setting fsGroup
  4. RWO PVCs are handled correctly, and their permissions change accordingly
  5. RWX PVCs of cinder-csi-default StorageClass don't change their permissions
  6. The notebook doesn't have permissions to modify files in the root/root RWX volume that was mounted

Since this is more of a problem of the underlying storage infrastructure not respecting K8s constructs, which Kubeflow relies on, I'll go on and close the issue since there's not much we can do from Kubeflow side