ChrisHirsch / kuberhealthy-storage-check

A storage check for the kuberhealthy project
Apache License 2.0
10 stars 13 forks source link

storage-check-pvc-init-job constantly failing #14

Open ElectricRabbit opened 4 years ago

ElectricRabbit commented 4 years ago

Hello, I'd like to ask about storage check job, I'm only getting /bin/sh: 1: cannot create /data/index.html: Permission denied using storage-check-pvc-init-job. I've just tried to enable allowPrivilegeEscalation and it did't help. Normally we have no problem to write to PVs. I'm thinking about securityContext.readOnlyRootFilesystem, but this swich is quite dangerous for production as it is global switch. Is it possible, that this test is not compatible with old StorageClass we are still using?

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'
provisioner: kubernetes.io/vsphere-volume
parameters:
  diskformat: thin
reclaimPolicy: Delete
volumeBindingMode: Immediate

What Am I missing?

EDIT: nope, not working even with securityContext.readOnlyRootFilesystem=false

github-actions[bot] commented 4 years ago

:wave: Welcome to Kuberhealthy Storage Check. Thanks for opening your first issue.

ChrisHirsch commented 3 years ago

Can you give me more details the environment you’re running in? This was running successfully in a VMware env but everyone does things slightly differently. Thanks!

ChrisHirsch commented 3 years ago

Also did you give the service the proper role to create the storage? That is critical and may be what you’re running into. If you look in the deploy directory you’ll need to make sure the service account name, storage-sa, in this case, has permissions (proper role and role binding) to creat the storage. Let me know if that helps any.

ElectricRabbit commented 3 years ago

@ChrisHirsch what kind of info do you need? There are a lot of things I could tell you about our environment 👍

There is no problem to create PV (storage), but to write something there, when it is created and connected to POD. We have no restrictions for deplyoments/pods to create storage and work with it. It is strange that pod would be able to create storage object (PV), but he would no longer have the rights to write to it. We have no restrictions for storage class, if something wants to use storage class, create PV and mount it.

ChrisHirsch commented 3 years ago

Can you by chance drop in the logs from the pod for the storage-check? Hopefully that will shed some light. Obviously this hasn't seen many environments...yet...but I do feel that this should be storage agnostic as it simply provisions storage from the SC and then creates a file on the PVC and then shares the around to the various nodes. Of course I'm sure I'll be proven wrong and probably have made some assumptions that are not necessarily generic and probably what you're running into.

Thanks for your patience!

ElectricRabbit commented 3 years ago

Can you by chance drop in the logs from the pod for the storage-check? Hopefully that will shed some light. Obviously this hasn't seen many environments...yet...but I do feel that this should be storage agnostic as it simply provisions storage from the SC and then creates a file on the PVC and then shares the around to the various nodes. Of course I'm sure I'll be proven wrong and probably have made some assumptions that are not necessarily generic and probably what you're running into.

Thanks for your patience!

@ChrisHirsch sorry, I missed you comment. Only log I got from that POD was the one above /bin/sh: 1: cannot create /data/index.html: Permission denied. Anyway, I think we can close this, beacaue we have big issues with our current storage class in general. As we are currently working on CSI plugin implementation and we want to migrate everything, this issue seems to be irreleveant. I'll get back to your tool using Kuberhealthy after this CSI migration.

blame19 commented 3 years ago

We're hitting this as well on AKS. The same /bin/sh: 1: cannot create /data/index.html: Permission denied. pops out on some of the storage classes we're testing (managed-premium and default).

storage-check-azurefile-1620284223                 0/1     Completed           0          77m
storage-check-azurefile-1620287823                 0/1     Completed           0          17m
storage-check-default-1620280623                   0/1     Completed           0          137m
storage-check-default-1620284224                   0/1     Completed           0          77m
storage-check-default-1620287823                   1/1     Running             0          17m
storage-check-managed-premium-1620280623           0/1     Completed           0          137m
storage-check-managed-premium-1620284224           0/1     Completed           0          77m
storage-check-managed-premium-1620287824           1/1     Running             0          17m
storage-check-pvc-default-init-job-4xfpn           0/1     Error               0          9m32s
storage-check-pvc-default-init-job-89d22           0/1     Error               0          6m33s
storage-check-pvc-default-init-job-bmhdx           0/1     Error               0          14m
storage-check-pvc-default-init-job-bzsc7           0/1     Error               0          16m
storage-check-pvc-default-init-job-kph2t           0/1     ContainerCreating   0          62s
storage-check-pvc-default-init-job-qpwcf           0/1     Error               0          14m
storage-check-pvc-managed-premium-init-job-7fwjf   0/1     ContainerCreating   0          18s
storage-check-pvc-managed-premium-init-job-8jbmh   0/1     Error               0          10m
storage-check-pvc-managed-premium-init-job-9pksz   0/1     Error               0          5m49s
storage-check-pvc-managed-premium-init-job-mvrxq   0/1     Error               0          15m
storage-check-pvc-managed-premium-init-job-ws72h   0/1     Error               0          15m
storage-check-pvc-managed-premium-init-job-z24fq   0/1     Error               0          16m

k logs storage-check-default-1620284224 -n synthetic

time="2021-05-06T06:57:30Z" level=info msg="Created storage in synthetic namespace: storage-check-pvc-default"
time="2021-05-06T06:57:30Z" level=info msg="Creating a job storage-check-pvc-default-init-job in synthetic namespace environment variables: map[]"
time="2021-05-06T06:57:30Z" level=info msg="Job  storage-check-pvc-default-init-job  is &Job{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:storage-check-pvc-default-init-job,GenerateName:,Namespace:synthetic,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{source: kuberhealthy,storage-timestamp: unix-1620284231,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:JobSpec{Parallelism:nil,Completions:nil,ActiveDeadlineSeconds:nil,Selector:nil,ManualSelector:nil,Template:k8s_io_api_core_v1.PodTemplateSpec{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:storage-check-pvc-default-init-job,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:PodSpec{Volumes:[{data {nil nil nil nil nil nil nil nil nil PersistentVolumeClaimVolumeSource{ClaimName:storage-check-pvc-default,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{storage-check-pvc-default-init-job bitnami/nginx:1.19 [/bin/sh] [-c echo storage-check-ok > /data/index.html & ls -la /data && cat /data/index.html]  [] [] [] {map[] map[]} [{data false /data  <nil> }] [] nil nil nil   IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:nil,ActiveDeadlineSeconds:nil,DNSPolicy:,NodeSelector:map[string]string{},ServiceAccountName:,DeprecatedServiceAccount:,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:nil,ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[],HostAliases:[],PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],RuntimeClassName:nil,EnableServiceLinks:nil,PreemptionPolicy:nil,},},BackoffLimit:nil,TTLSecondsAfterFinished:nil,},Status:JobStatus{Conditions:[],StartTime:<nil>,CompletionTime:<nil>,Active:0,Succeeded:0,Failed:0,},} namespace environment variables: map[]"
time="2021-05-06T06:57:30Z" level=info msg="Created Storage Initialiazer resource."
time="2021-05-06T06:57:30Z" level=info msg="It looks like: &Job{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:storage-check-pvc-default-init-job,GenerateName:,Namespace:synthetic,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{source: kuberhealthy,storage-timestamp: unix-1620284231,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:JobSpec{Parallelism:nil,Completions:nil,ActiveDeadlineSeconds:nil,Selector:nil,ManualSelector:nil,Template:k8s_io_api_core_v1.PodTemplateSpec{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:storage-check-pvc-default-init-job,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:PodSpec{Volumes:[{data {nil nil nil nil nil nil nil nil nil PersistentVolumeClaimVolumeSource{ClaimName:storage-check-pvc-default,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{storage-check-pvc-default-init-job bitnami/nginx:1.19 [/bin/sh] [-c echo storage-check-ok > /data/index.html & ls -la /data && cat /data/index.html]  [] [] [] {map[] map[]} [{data false /data  <nil> }] [] nil nil nil   IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:nil,ActiveDeadlineSeconds:nil,DNSPolicy:,NodeSelector:map[string]string{},ServiceAccountName:,DeprecatedServiceAccount:,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:nil,ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[],HostAliases:[],PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],RuntimeClassName:nil,EnableServiceLinks:nil,PreemptionPolicy:nil,},},BackoffLimit:nil,TTLSecondsAfterFinished:nil,},Status:JobStatus{Conditions:[],StartTime:<nil>,CompletionTime:<nil>,Active:0,Succeeded:0,Failed:0,},}"
time="2021-05-06T06:57:30Z" level=info msg="Initializing storage in cluster with name: storage-check-pvc-default-init-job"
time="2021-05-06T06:57:30Z" level=info msg="Watching for storage initializer Job to exist."
time="2021-05-06T06:57:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event ADDED"
time="2021-05-06T06:57:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:01:44Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:01:54Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:04:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:09:22Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:16:12Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:16:59Z" level=info msg="Cancelling init storage job and shutting down due to interrupt. err:context deadline exceeded"
time="2021-05-06T07:16:59Z" level=error msg="Reporting errors to Kuberhealthy: [failed to initialize storage storage within timeout]"

k logs storage-check-pvc-default-init-job-wfrnz -n synthetic /bin/sh: 1: cannot create /data/index.html: Permission denied

blame19 commented 3 years ago

Can we reopen?

ChrisHirsch commented 3 years ago

Sure...so can I get some additional information on these environments? If it's easier we can also chat in #kuberhealthy too.

blame19 commented 3 years ago

Sure; what kind of information could help? I was looking into this and I noticed that the init-job on my cluster doesn't fail for the storage class "azurefile", so maybe there's something related to how azure-disks manages permissions

> k describe sc managed-premium
> Name:            managed-premium
> IsDefaultClass:  No
> Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{},"labels":{"kubernetes.io/cluster-service":"true"},"name":"managed-premium"},"parameters":{"cachingmode":"ReadOnly","kind":"Managed","storageaccounttype":"Premium_LRS"},"provisioner":"kubernetes.io/azure-disk"}
> 
> Provisioner:           kubernetes.io/azure-disk
> Parameters:            cachingmode=ReadOnly,kind=Managed,storageaccounttype=Premium_LRS
> AllowVolumeExpansion:  True
> MountOptions:          <none>
> ReclaimPolicy:         Delete
> VolumeBindingMode:     Immediate
> Events:                <none>
>
>  k describe sc default
> Name:            default
> IsDefaultClass:  Yes
> Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"true"},"labels":{"kubernetes.io/cluster-service":"true"},"name":"default"},"parameters":{"cachingmode":"ReadOnly","kind":"Managed","storageaccounttype":"StandardSSD_LRS"},"provisioner":"kubernetes.io/azure-disk"}
> ,storageclass.beta.kubernetes.io/is-default-class=true
> Provisioner:           kubernetes.io/azure-disk
> Parameters:            cachingmode=ReadOnly,kind=Managed,storageaccounttype=StandardSSD_LRS
> AllowVolumeExpansion:  True
> MountOptions:          <none>
> ReclaimPolicy:         Delete
> VolumeBindingMode:     Immediate
> Events:                <none>
>
> k describe sc azurefile
> Name:            azurefile
> IsDefaultClass:  No
> Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{},"labels":{"kubernetes.io/cluster-service":"true"},"name":"azurefile"},"parameters":{"skuName":"Standard_LRS"},"provisioner":"kubernetes.io/azure-file"}
> 
> Provisioner:           kubernetes.io/azure-file
> Parameters:            skuName=Standard_LRS
> AllowVolumeExpansion:  True
> MountOptions:          <none>
> ReclaimPolicy:         Delete
> VolumeBindingMode:     Immediate
> Events:                <none>
blame19 commented 3 years ago

So, I dig around and found that the pvc gets mounted under /data/ with the following permissions:

storage-check-pvc-default-custom-job-tspvb:/# ls -la
total 88
drwxr-xr-x   1 root root 4096 May 10 12:59 .
drwxr-xr-x   1 root root 4096 May 10 12:59 ..
drwxr-xr-x   2 root root 4096 Apr 29 17:26 app
drwxr-xr-x   1 root root 4096 May  5 02:46 bin
drwxr-xr-x   3 root root 4096 May  5 02:47 bitnami
drwxr-xr-x   2 root root 4096 Feb 18 11:59 boot
drwxrwxr-x   2 root root 4096 May  5 02:46 certs
drwxrwsr-x   3 root root 4096 May 10 13:03 data
drwxr-xr-x   5 root root  360 May 10 12:59 dev
drwxr-xr-x   1 root root 4096 May 10 12:59 etc
drwxr-xr-x   2 root root 4096 Feb 18 11:59 home
drwxr-xr-x   1 root root 4096 Sep 25  2017 lib
drwxr-xr-x   2 root root 4096 Feb 18 11:59 lib64
drwxr-xr-x   2 root root 4096 Feb 18 11:59 media
drwxr-xr-x   2 root root 4096 Feb 18 11:59 mnt
drwxrwxr-x   1 root root 4096 May  5 02:46 opt
dr-xr-xr-x 508 root root    0 May 10 12:59 proc
drwx------   2 root root 4096 Feb 18 11:59 root
drwxr-xr-x   1 root root 4096 May 10 12:59 run
drwxr-xr-x   1 root root 4096 May  5 02:46 sbin
drwxr-xr-x   2 root root 4096 Feb 18 11:59 srv
dr-xr-xr-x  12 root root    0 May 10 12:59 sys
drwxrwxrwt   1 root root 4096 May  5 02:46 tmp
drwxrwxr-x   1 root root 4096 May  5 02:46 usr
drwxr-xr-x   1 root root 4096 Feb 18 11:59 var

By default the pvc-jobs don't specify a securityContext. As a result, at least in my enviroment they don't have enough permissions to modify files under /data/. I tested this by deploying a custom job which runs as root:

apiVersion: batch/v1
kind: Job
metadata:
  name: storage-check-pvc-default-custom-job
  namespace: synthetic
spec:
  template:
    metadata:
      creationTimestamp: null
      generateName: storage-check-pvc-default-custom-job
      labels:
        job-name: storage-check-pvc-default-custom-job
    spec:
      containers:
      - args:
        - -c
        - sleep 183600
        command:
        - /bin/sh
        image: bitnami/nginx:1.19
        imagePullPolicy: IfNotPresent
        name: storage-check-pvc-defaulti-custom-job
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          name: data
      securityContext:
          runAsUser: 0
          fsGroup: 0
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: storage-check-pvc-default

and this ran correctly. By entering the container I was able (as expected) to open and create files under /data/. Of course making the job run as root is a hacky way of making it work. The permissions on the data folders should be 777 instead they are drwxrwsr-x.

blame19 commented 3 years ago

So, I tested out another custom job which mounts the same security context as the default specified in your yaml files:

  name: storage-check-pvc-default-custom-job-2
  namespace: synthetic
  resourceVersion: "717789529"
  selfLink: /apis/batch/v1/namespaces/synthetic/jobs/storage-check-pvc-default-custom-job-2
  uid: 5511697b-a92f-4b73-ab2d-fa6f7c9b1542
spec:
  ...
    spec:
      containers:
      - args:
        - -c
        - echo storage-check-ok > /data/index.html && ls -la /data && cat /data/index.html
        command:
        - /bin/sh
        image: bitnami/nginx:1.19
        imagePullPolicy: IfNotPresent
        name: storage-check-pvc-default-custom-job-2
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data
          name: data
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 999
        runAsUser: 999
      ...
status:
  completionTime: "2021-05-10T16:03:37Z"
  conditions:
  - lastProbeTime: "2021-05-10T16:03:37Z"
    lastTransitionTime: "2021-05-10T16:03:37Z"
    status: "True"
    type: Complete
  startTime: "2021-05-10T15:25:09Z"
  succeeded: 1

this actually succeeded:

k get jobs -n synthetic
NAME                                         COMPLETIONS   DURATION   AGE
storage-check-pvc-default-custom-job-2       1/1           38m        68m

So a solution to this issue would be making sure that the jobs are created with a correct securityContext. However (and correct me if I'm wrong) those jobs are created at https://github.com/Comcast/kuberhealthy/, not in your repo, so maybe we have to fix something on Comcast's side.

ojasaar commented 11 months ago

In case someone is still battling with this. The problem is that the pvc init job runs as whatever the user of the docker image is (as it does not set a securityContext). In case of bitnami/nginx:1.19 it is 1001. Most PVs have root FS permissions so that's why you get permission denied. The solution is to use an image with root as the default user like nginx:1.25-perl