[BUG?] Cannot dynamically create read/write persistent volumes anymore? Read only?

hippyod commented 3 weeks ago

General information

OS: Linux / macOS / Windows
Hypervisor: KVM / Hyper-V / hyperkit / vfkit
Did you run crc setup before starting it (Yes/No)?
Running CRC on: Laptop / Baremetal-Server / VM

CRC version

WARN A new version (2.38.0) has been published on https://developers.redhat.com/content-gateway/file/pub/openshift-v4/clients/crc/2.38.0/crc-linux-amd64.tar.xz 
CRC version: 2.37.1+36d451
OpenShift version: 4.15.14

CRC status

CRC VM:          Running
OpenShift:       Running (v4.15.14)
RAM Usage:       12.13GB of 67.43GB
Disk Usage:      31.84GB of 136.8GB (Inside the CRC VM)
Cache Usage:     26.83GB
Cache Directory: /home/hippyod/.crc/cache

CRC config

- consent-telemetry                     : yes
- cpus                                  : 10
- disk-size                             : 128
- enable-cluster-monitoring             : true
- memory                                : 65536

Host Operating System

NAME="Fedora Linux"
VERSION="40 (Workstation Edition)"
ID=fedora
VERSION_ID=40
VERSION_CODENAME=""
PLATFORM_ID="platform:f40"
PRETTY_NAME="Fedora Linux 40 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:40"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f40/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=40
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=40
SUPPORT_END=2025-05-13
VARIANT="Workstation Edition"
VARIANT_ID=workstation

Steps to reproduce

create a Deployment with a persistent volume (dynamic)
Try to write to it

Expected

It should be read/write by default (always was this way)

Actual

Read only now. I also tested with the latest version, too, and I went back one release to test this, just in case.

Logs

I get the following from the pod when the Jenkins utility tries to download plugins using it's utility. It fails now when I try using a persistent volume. This has worked for me for a couple of years now, until the last couple of releases or so?

File containing list of plugins to be downloaded: /var/lib/jenkins/jenkins-plugins/jenkins-plugins.txt
Reading in plugins from /var/lib/jenkins/jenkins-plugins/jenkins-plugins.txt

Plugin download location: /var/lib/jenkins/plugins
Using update center https://updates.jenkins.io/update-center.json from JENKINS_UC environment variable
No CLI option or environment variable set for experimental update center, using default of https://updates.jenkins.io/experimental/update-center.json
No CLI option or environment variable set for incrementals mirror, using default of https://repo.jenkins-ci.org/incrementals
No CLI option or environment variable set for plugin info, using default of https://updates.jenkins.io/plugin-versions.json
Will use war file: /usr/share/java/jenkins.war
io.jenkins.tools.pluginmanager.impl.DirectoryCreationException: Unable to create plugin directory: '/var/lib/jenkins/plugins', supply a directory with -d <your-directory>

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

I had to go back to the normal container filesystem to get it working. Would like to have my persistent volumes back. Did something change in the VM? Or in OpenShift 4.15 I wasn't aware of? Or in the way dynamic persistent volumes are configured in CRC now?

I also started to notice this error, and I don't know if this is related, or I should open another bug? From a recent Jenkins build that caused a failure in my Jenkins agent.

+ podman login --tls-verify=false --username elcicddev --password **** dev-demo-image-registry.apps-crc.testing
time="2024-07-07T09:42:49Z" level=warning msg="\"/\" is not a shared mount, this could cause issues or missing mounts with rootless containers"
time="2024-07-07T09:42:49Z" level=error msg="running `/usr/bin/newuidmap 10106 0 1001 1 1 100000 65536`: newuidmap: write to uid_map failed: Operation not permitted\n"

That from my Jenkins logs trying to build an image in a pod. I am using the same method I have been using for two years creating a custom SCC to enable rootless builds in cluster. Again, did something change in how the image is created? Or did something change in OpenShift that breaks this?

hippyod commented 3 weeks ago

I saw a couple of years ago someone from the buildah/podman folks posted a bug about that might be related: #2968.

hippyod commented 3 weeks ago

So I downloaded a CRC version from back in Feb (OCP 4.14) that I knew was good, and tested just to be sure. Doesn't work, and I've been mounting these same volumes and using them for years with CRC. Totally at a loss as to how to proceed at this point.

anjannath commented 3 weeks ago

@hippyod Hi, did you try to create a test pod and mount a PV in there and still faced the the "unable to write" issue or is it just with jenkins?

Could you please share the steps to deploy jenkins on CRC, so that we can try to reproduce the issue, because it could also be that something have changed to the pvc definition jenkins is using?

hippyod commented 3 weeks ago

@anjannath @praveenkumar @cfergeau OK, after MANY HOURS of playing around today and yesterday, here's what I figured out (change it to a YAML file; GitHub wouldn't let me upload a *.yaml): test.txt

On CRC (of course), create a namespace called my-pvc-test, and apply the yaml from the file. This will create two Deployments, two PVCs, one Service, and one ServiceAccount with one ClusterRoleBinding giving the ServiceAccount cluster role privileges.

Each PVC is mounted into the pod from the Deployment of read-write or read-only. In the read-only Deployment, the ServiceAcount is assigned to the cluster-admin ServiceAccount. In read-write, the default user. Both are using the bash image from DockerHub, and sleeping for a long time in the command entry point just to keep the pods up for testing.

Once the pods are running, rsh into each. In read-only, touch /var/lib/jenkins/foo will fail, and read-write will not fail. The difference is that in read-only, entering id will give you the following output: uid=1001(1001) gid=0(root) groups=0(root), whereas in read-write it's uid=1000680000(1000680000) gid=0(root) groups=0(root),1000680000.

Please confirm you can replicate the problem.

praveenkumar commented 3 weeks ago

I tried the yaml file which you shared on F40 with selinux enabled and I am not able to reproduce the issue which you are facing :(

10:48 $ oc get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
my-data-1   Bound    pvc-c0f367ac-c494-4d74-8042-542fe716ae8b   49Gi       RWX            crc-csi-hostpath-provisioner   14s
my-data-2   Bound    pvc-28e8988e-611c-4ca1-8bb0-18b39aa57e01   49Gi       RWX            crc-csi-hostpath-provisioner   14s

10:48 $ oc get all
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                              READY   STATUS    RESTARTS   AGE
pod/read-only-596cdf84c5-vldwj    1/1     Running   0          23s
pod/read-write-84f4b58fb9-lhlgs   1/1     Running   0          23s

NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/read-only    ClusterIP   10.217.4.100   <none>        8080/TCP   23s
service/read-write   ClusterIP   10.217.5.111   <none>        8080/TCP   23s

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/read-only    1/1     1            1           23s
deployment.apps/read-write   1/1     1            1           23s

NAME                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/read-only-596cdf84c5    1         1         1       23s
replicaset.apps/read-write-84f4b58fb9   1         1         1       23s

10:48 $ oc rsh read-only-596cdf84c5-vldwj
~ $ id
uid=1000670000(1000670000) gid=0(root) groups=0(root),1000670000
~ $ ls /var/lib/jenkins/
~ $ touch /var/lib/jenkins/test
~ $ vi /var/lib/jenkins/test
$ cat /var/lib/jenkins/test 
afajd
afkadjfk
adfjakf
~ $ id
uid=1000670000(1000670000) gid=0(root) groups=0(root),1000670000
~ $ exit

10:50 $ oc rsh read-write-84f4b58fb9-lhlgs 
~ $ touch /var/lib/jenkins/test
~ $ vi /var/lib/jenkins/test 
~ $ cat /var/lib/jenkins/test 
adfja
akdfja
adfkja
~ $ id
uid=1000670000(1000670000) gid=0(root) groups=0(root),1000670000
~ $ exit

10:51 $ ./crc status
CRC VM:          Running
OpenShift:       Running (v4.15.17)
RAM Usage:       7.524GB of 10.92GB
Disk Usage:      37.11GB of 53.08GB (Inside the CRC VM)
Cache Usage:     174.5GB
Cache Directory: /home/prkumar/.crc/cache

10:51 $ getenforce 
Enforcing

Are you using CSB provided F40?

hippyod commented 2 weeks ago

@praveenkumar @anjannath @cfergeau Deep apologies for taking awhile to get back with all y'all. Ran a huge amount of tests, up to and including rebuilding my machine from scratch to make sure it wasn't an OS problem. I wanted to be thorough.

Long and short of it is that what I thought were some minor change to my custom SecurityContextConstraints to support nonroot podman builds weren't trivial at all. Somehow they corrupted everything, and I have no idea why (I do NOT understand them as well as I'd hoped). The RH documentation on the subject on

Everything is working as it did before. I do not understand (I want to emphasize this quite a bit) why the example I sent you failed on my machine even though there was no relation to the SCC in my example. I don't understand why my custom SCC changes made such a mess of things, but I'm only an application developer.

For posterity's sake, the fixes I stumbled on were:

allowPrivilegeEscalation: true (for podman to work correctly in rootless mode)

fsGroup: type: MustRunAs (for cluster-admins to properly set the volumes)

Alternative work-arounds I avoided as more privileged: allowHostDirVolumePlugin: true

fsGroup: type: RunAsAny (in this case, I had to set fsGroup: 0 on the Deployment)

Sorry for confusion. I wish I understood OpenShift security better, but hopefully in the future if someone makes a stupid mistake like this again they'll find this info. Thanks again for the help.

crc-org / crc