Open ElectricRabbit opened 4 years ago
:wave: Welcome to Kuberhealthy Storage Check. Thanks for opening your first issue.
Can you give me more details the environment you’re running in? This was running successfully in a VMware env but everyone does things slightly differently. Thanks!
Also did you give the service the proper role to create the storage? That is critical and may be what you’re running into. If you look in the deploy directory you’ll need to make sure the service account name, storage-sa, in this case, has permissions (proper role and role binding) to creat the storage. Let me know if that helps any.
@ChrisHirsch what kind of info do you need? There are a lot of things I could tell you about our environment 👍
There is no problem to create PV (storage), but to write something there, when it is created and connected to POD. We have no restrictions for deplyoments/pods to create storage and work with it. It is strange that pod would be able to create storage object (PV), but he would no longer have the rights to write to it. We have no restrictions for storage class, if something wants to use storage class, create PV and mount it.
Can you by chance drop in the logs from the pod for the storage-check? Hopefully that will shed some light. Obviously this hasn't seen many environments...yet...but I do feel that this should be storage agnostic as it simply provisions storage from the SC and then creates a file on the PVC and then shares the around to the various nodes. Of course I'm sure I'll be proven wrong and probably have made some assumptions that are not necessarily generic and probably what you're running into.
Thanks for your patience!
Can you by chance drop in the logs from the pod for the storage-check? Hopefully that will shed some light. Obviously this hasn't seen many environments...yet...but I do feel that this should be storage agnostic as it simply provisions storage from the SC and then creates a file on the PVC and then shares the around to the various nodes. Of course I'm sure I'll be proven wrong and probably have made some assumptions that are not necessarily generic and probably what you're running into.
Thanks for your patience!
@ChrisHirsch sorry, I missed you comment. Only log I got from that POD was the one above /bin/sh: 1: cannot create /data/index.html: Permission denied. Anyway, I think we can close this, beacaue we have big issues with our current storage class in general. As we are currently working on CSI plugin implementation and we want to migrate everything, this issue seems to be irreleveant. I'll get back to your tool using Kuberhealthy after this CSI migration.
We're hitting this as well on AKS. The same /bin/sh: 1: cannot create /data/index.html: Permission denied. pops out on some of the storage classes we're testing (managed-premium and default).
storage-check-azurefile-1620284223 0/1 Completed 0 77m
storage-check-azurefile-1620287823 0/1 Completed 0 17m
storage-check-default-1620280623 0/1 Completed 0 137m
storage-check-default-1620284224 0/1 Completed 0 77m
storage-check-default-1620287823 1/1 Running 0 17m
storage-check-managed-premium-1620280623 0/1 Completed 0 137m
storage-check-managed-premium-1620284224 0/1 Completed 0 77m
storage-check-managed-premium-1620287824 1/1 Running 0 17m
storage-check-pvc-default-init-job-4xfpn 0/1 Error 0 9m32s
storage-check-pvc-default-init-job-89d22 0/1 Error 0 6m33s
storage-check-pvc-default-init-job-bmhdx 0/1 Error 0 14m
storage-check-pvc-default-init-job-bzsc7 0/1 Error 0 16m
storage-check-pvc-default-init-job-kph2t 0/1 ContainerCreating 0 62s
storage-check-pvc-default-init-job-qpwcf 0/1 Error 0 14m
storage-check-pvc-managed-premium-init-job-7fwjf 0/1 ContainerCreating 0 18s
storage-check-pvc-managed-premium-init-job-8jbmh 0/1 Error 0 10m
storage-check-pvc-managed-premium-init-job-9pksz 0/1 Error 0 5m49s
storage-check-pvc-managed-premium-init-job-mvrxq 0/1 Error 0 15m
storage-check-pvc-managed-premium-init-job-ws72h 0/1 Error 0 15m
storage-check-pvc-managed-premium-init-job-z24fq 0/1 Error 0 16m
k logs storage-check-default-1620284224 -n synthetic
time="2021-05-06T06:57:30Z" level=info msg="Created storage in synthetic namespace: storage-check-pvc-default"
time="2021-05-06T06:57:30Z" level=info msg="Creating a job storage-check-pvc-default-init-job in synthetic namespace environment variables: map[]"
time="2021-05-06T06:57:30Z" level=info msg="Job storage-check-pvc-default-init-job is &Job{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:storage-check-pvc-default-init-job,GenerateName:,Namespace:synthetic,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{source: kuberhealthy,storage-timestamp: unix-1620284231,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:JobSpec{Parallelism:nil,Completions:nil,ActiveDeadlineSeconds:nil,Selector:nil,ManualSelector:nil,Template:k8s_io_api_core_v1.PodTemplateSpec{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:storage-check-pvc-default-init-job,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:PodSpec{Volumes:[{data {nil nil nil nil nil nil nil nil nil PersistentVolumeClaimVolumeSource{ClaimName:storage-check-pvc-default,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{storage-check-pvc-default-init-job bitnami/nginx:1.19 [/bin/sh] [-c echo storage-check-ok > /data/index.html & ls -la /data && cat /data/index.html] [] [] [] {map[] map[]} [{data false /data <nil> }] [] nil nil nil IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:nil,ActiveDeadlineSeconds:nil,DNSPolicy:,NodeSelector:map[string]string{},ServiceAccountName:,DeprecatedServiceAccount:,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:nil,ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[],HostAliases:[],PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],RuntimeClassName:nil,EnableServiceLinks:nil,PreemptionPolicy:nil,},},BackoffLimit:nil,TTLSecondsAfterFinished:nil,},Status:JobStatus{Conditions:[],StartTime:<nil>,CompletionTime:<nil>,Active:0,Succeeded:0,Failed:0,},} namespace environment variables: map[]"
time="2021-05-06T06:57:30Z" level=info msg="Created Storage Initialiazer resource."
time="2021-05-06T06:57:30Z" level=info msg="It looks like: &Job{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:storage-check-pvc-default-init-job,GenerateName:,Namespace:synthetic,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{source: kuberhealthy,storage-timestamp: unix-1620284231,},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:JobSpec{Parallelism:nil,Completions:nil,ActiveDeadlineSeconds:nil,Selector:nil,ManualSelector:nil,Template:k8s_io_api_core_v1.PodTemplateSpec{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:storage-check-pvc-default-init-job,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,ManagedFields:[],},Spec:PodSpec{Volumes:[{data {nil nil nil nil nil nil nil nil nil PersistentVolumeClaimVolumeSource{ClaimName:storage-check-pvc-default,ReadOnly:false,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{storage-check-pvc-default-init-job bitnami/nginx:1.19 [/bin/sh] [-c echo storage-check-ok > /data/index.html & ls -la /data && cat /data/index.html] [] [] [] {map[] map[]} [{data false /data <nil> }] [] nil nil nil IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:nil,ActiveDeadlineSeconds:nil,DNSPolicy:,NodeSelector:map[string]string{},ServiceAccountName:,DeprecatedServiceAccount:,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:nil,ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[],HostAliases:[],PriorityClassName:,Priority:nil,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],RuntimeClassName:nil,EnableServiceLinks:nil,PreemptionPolicy:nil,},},BackoffLimit:nil,TTLSecondsAfterFinished:nil,},Status:JobStatus{Conditions:[],StartTime:<nil>,CompletionTime:<nil>,Active:0,Succeeded:0,Failed:0,},}"
time="2021-05-06T06:57:30Z" level=info msg="Initializing storage in cluster with name: storage-check-pvc-default-init-job"
time="2021-05-06T06:57:30Z" level=info msg="Watching for storage initializer Job to exist."
time="2021-05-06T06:57:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event ADDED"
time="2021-05-06T06:57:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:01:44Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:01:54Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:04:30Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:09:22Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:16:12Z" level=debug msg="Received an event watching for storage changes: storage-check-pvc-default-init-job got event MODIFIED"
time="2021-05-06T07:16:59Z" level=info msg="Cancelling init storage job and shutting down due to interrupt. err:context deadline exceeded"
time="2021-05-06T07:16:59Z" level=error msg="Reporting errors to Kuberhealthy: [failed to initialize storage storage within timeout]"
k logs storage-check-pvc-default-init-job-wfrnz -n synthetic
/bin/sh: 1: cannot create /data/index.html: Permission denied
Can we reopen?
Sure...so can I get some additional information on these environments? If it's easier we can also chat in #kuberhealthy too.
Sure; what kind of information could help? I was looking into this and I noticed that the init-job on my cluster doesn't fail for the storage class "azurefile", so maybe there's something related to how azure-disks manages permissions
> k describe sc managed-premium
> Name: managed-premium
> IsDefaultClass: No
> Annotations: kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{},"labels":{"kubernetes.io/cluster-service":"true"},"name":"managed-premium"},"parameters":{"cachingmode":"ReadOnly","kind":"Managed","storageaccounttype":"Premium_LRS"},"provisioner":"kubernetes.io/azure-disk"}
>
> Provisioner: kubernetes.io/azure-disk
> Parameters: cachingmode=ReadOnly,kind=Managed,storageaccounttype=Premium_LRS
> AllowVolumeExpansion: True
> MountOptions: <none>
> ReclaimPolicy: Delete
> VolumeBindingMode: Immediate
> Events: <none>
>
> k describe sc default
> Name: default
> IsDefaultClass: Yes
> Annotations: kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"true"},"labels":{"kubernetes.io/cluster-service":"true"},"name":"default"},"parameters":{"cachingmode":"ReadOnly","kind":"Managed","storageaccounttype":"StandardSSD_LRS"},"provisioner":"kubernetes.io/azure-disk"}
> ,storageclass.beta.kubernetes.io/is-default-class=true
> Provisioner: kubernetes.io/azure-disk
> Parameters: cachingmode=ReadOnly,kind=Managed,storageaccounttype=StandardSSD_LRS
> AllowVolumeExpansion: True
> MountOptions: <none>
> ReclaimPolicy: Delete
> VolumeBindingMode: Immediate
> Events: <none>
>
> k describe sc azurefile
> Name: azurefile
> IsDefaultClass: No
> Annotations: kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{},"labels":{"kubernetes.io/cluster-service":"true"},"name":"azurefile"},"parameters":{"skuName":"Standard_LRS"},"provisioner":"kubernetes.io/azure-file"}
>
> Provisioner: kubernetes.io/azure-file
> Parameters: skuName=Standard_LRS
> AllowVolumeExpansion: True
> MountOptions: <none>
> ReclaimPolicy: Delete
> VolumeBindingMode: Immediate
> Events: <none>
So, I dig around and found that the pvc gets mounted under /data/ with the following permissions:
storage-check-pvc-default-custom-job-tspvb:/# ls -la
total 88
drwxr-xr-x 1 root root 4096 May 10 12:59 .
drwxr-xr-x 1 root root 4096 May 10 12:59 ..
drwxr-xr-x 2 root root 4096 Apr 29 17:26 app
drwxr-xr-x 1 root root 4096 May 5 02:46 bin
drwxr-xr-x 3 root root 4096 May 5 02:47 bitnami
drwxr-xr-x 2 root root 4096 Feb 18 11:59 boot
drwxrwxr-x 2 root root 4096 May 5 02:46 certs
drwxrwsr-x 3 root root 4096 May 10 13:03 data
drwxr-xr-x 5 root root 360 May 10 12:59 dev
drwxr-xr-x 1 root root 4096 May 10 12:59 etc
drwxr-xr-x 2 root root 4096 Feb 18 11:59 home
drwxr-xr-x 1 root root 4096 Sep 25 2017 lib
drwxr-xr-x 2 root root 4096 Feb 18 11:59 lib64
drwxr-xr-x 2 root root 4096 Feb 18 11:59 media
drwxr-xr-x 2 root root 4096 Feb 18 11:59 mnt
drwxrwxr-x 1 root root 4096 May 5 02:46 opt
dr-xr-xr-x 508 root root 0 May 10 12:59 proc
drwx------ 2 root root 4096 Feb 18 11:59 root
drwxr-xr-x 1 root root 4096 May 10 12:59 run
drwxr-xr-x 1 root root 4096 May 5 02:46 sbin
drwxr-xr-x 2 root root 4096 Feb 18 11:59 srv
dr-xr-xr-x 12 root root 0 May 10 12:59 sys
drwxrwxrwt 1 root root 4096 May 5 02:46 tmp
drwxrwxr-x 1 root root 4096 May 5 02:46 usr
drwxr-xr-x 1 root root 4096 Feb 18 11:59 var
By default the pvc-jobs don't specify a securityContext. As a result, at least in my enviroment they don't have enough permissions to modify files under /data/. I tested this by deploying a custom job which runs as root:
apiVersion: batch/v1
kind: Job
metadata:
name: storage-check-pvc-default-custom-job
namespace: synthetic
spec:
template:
metadata:
creationTimestamp: null
generateName: storage-check-pvc-default-custom-job
labels:
job-name: storage-check-pvc-default-custom-job
spec:
containers:
- args:
- -c
- sleep 183600
command:
- /bin/sh
image: bitnami/nginx:1.19
imagePullPolicy: IfNotPresent
name: storage-check-pvc-defaulti-custom-job
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data
name: data
securityContext:
runAsUser: 0
fsGroup: 0
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: storage-check-pvc-default
and this ran correctly. By entering the container I was able (as expected) to open and create files under /data/. Of course making the job run as root is a hacky way of making it work. The permissions on the data folders should be 777 instead they are drwxrwsr-x.
So, I tested out another custom job which mounts the same security context as the default specified in your yaml files:
name: storage-check-pvc-default-custom-job-2
namespace: synthetic
resourceVersion: "717789529"
selfLink: /apis/batch/v1/namespaces/synthetic/jobs/storage-check-pvc-default-custom-job-2
uid: 5511697b-a92f-4b73-ab2d-fa6f7c9b1542
spec:
...
spec:
containers:
- args:
- -c
- echo storage-check-ok > /data/index.html && ls -la /data && cat /data/index.html
command:
- /bin/sh
image: bitnami/nginx:1.19
imagePullPolicy: IfNotPresent
name: storage-check-pvc-default-custom-job-2
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data
name: data
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext:
fsGroup: 999
runAsUser: 999
...
status:
completionTime: "2021-05-10T16:03:37Z"
conditions:
- lastProbeTime: "2021-05-10T16:03:37Z"
lastTransitionTime: "2021-05-10T16:03:37Z"
status: "True"
type: Complete
startTime: "2021-05-10T15:25:09Z"
succeeded: 1
this actually succeeded:
k get jobs -n synthetic
NAME COMPLETIONS DURATION AGE
storage-check-pvc-default-custom-job-2 1/1 38m 68m
So a solution to this issue would be making sure that the jobs are created with a correct securityContext. However (and correct me if I'm wrong) those jobs are created at https://github.com/Comcast/kuberhealthy/, not in your repo, so maybe we have to fix something on Comcast's side.
In case someone is still battling with this. The problem is that the pvc init job runs as whatever the user of the docker image is (as it does not set a securityContext
). In case of bitnami/nginx:1.19
it is 1001. Most PVs have root FS permissions so that's why you get permission denied
.
The solution is to use an image with root as the default user like nginx:1.25-perl
Hello, I'd like to ask about storage check job, I'm only getting /bin/sh: 1: cannot create /data/index.html: Permission denied using storage-check-pvc-init-job. I've just tried to enable allowPrivilegeEscalation and it did't help. Normally we have no problem to write to PVs. I'm thinking about securityContext.readOnlyRootFilesystem, but this swich is quite dangerous for production as it is global switch. Is it possible, that this test is not compatible with old StorageClass we are still using?
What Am I missing?
EDIT: nope, not working even with
securityContext.readOnlyRootFilesystem=false