carlosedp / cluster-monitoring

Cluster monitoring stack for clusters based on Prometheus Operator
MIT License
740 stars 200 forks source link

Adding PV results in CreateContainerConfigError and CrashLoopBackOff #70

Closed Henrik-Wo closed 4 years ago

Henrik-Wo commented 4 years ago

If I apply the project to my cluster with a PV it results in the following issue:

% kubectl get pv
NAME         CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                           STORAGECLASS   REASON   AGE
grafana      20Gi       RWO            Retain           Bound    monitoring/grafana-storage                      manual                  2m40s
prometheus   2Gi        RWO            Retain           Bound    monitoring/prometheus-k8s-db-prometheus-k8s-0   manual                  2m28s

% kubectl get pods -n monitoring -o wide
NAME                                   READY   STATUS                       RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
prometheus-operator-6b8868d698-ssp5f   2/2     Running                      0          72s   10.42.2.83       kube-worker-02   <none>           <none>
arm-exporter-qlp2k                     2/2     Running                      0          60s   10.42.1.73       kube-worker-01   <none>           <none>
arm-exporter-cqszm                     2/2     Running                      0          60s   10.42.2.84       kube-worker-02   <none>           <none>
arm-exporter-sszn6                     2/2     Running                      0          60s   10.42.0.81       kube-master-1    <none>           <none>
alertmanager-main-0                    2/2     Running                      0          61s   10.42.2.85       kube-worker-02   <none>           <none>
node-exporter-tkp6t                    2/2     Running                      0          47s   192.168.78.160   kube-worker-01   <none>           <none>
node-exporter-svpg2                    2/2     Running                      0          47s   192.168.78.150   kube-master-1    <none>           <none>
prometheus-adapter-f78c4f4ff-4rtg9     1/1     Running                      0          43s   10.42.1.74       kube-worker-01   <none>           <none>
kube-state-metrics-96bf99844-k6kll     3/3     Running                      0          47s   10.42.2.87       kube-worker-02   <none>           <none>
node-exporter-hx9vh                    2/2     Running                      0          47s   192.168.78.161   kube-worker-02   <none>           <none>
prometheus-k8s-0                       2/3     CreateContainerConfigError   0          34s   10.42.1.75       kube-worker-01   <none>           <none>
grafana-7466bcc7c5-l24jf               0/1     CrashLoopBackOff             2          48s   10.42.2.86       kube-worker-02   <none>           <none>

If I chose to run without a PV (what I do not prefer) the project runs smoothly:

% kubectl get pods -n monitoring -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP               NODE             NOMINATED NODE   READINESS GATES
prometheus-operator-6b8868d698-5gk7v   2/2     Running   0          53s   10.42.2.88       kube-worker-02   <none>           <none>
arm-exporter-sfs9b                     2/2     Running   0          41s   10.42.2.90       kube-worker-02   <none>           <none>
arm-exporter-zsdwm                     2/2     Running   0          41s   10.42.1.76       kube-worker-01   <none>           <none>
arm-exporter-tj6mj                     2/2     Running   0          41s   10.42.0.82       kube-master-1    <none>           <none>
alertmanager-main-0                    2/2     Running   0          42s   10.42.2.89       kube-worker-02   <none>           <none>
node-exporter-vd2p7                    2/2     Running   0          29s   192.168.78.160   kube-worker-01   <none>           <none>
node-exporter-bmljr                    2/2     Running   0          29s   192.168.78.150   kube-master-1    <none>           <none>
kube-state-metrics-96bf99844-jzpqg     3/3     Running   0          30s   10.42.2.91       kube-worker-02   <none>           <none>
node-exporter-td5c6                    2/2     Running   0          29s   192.168.78.161   kube-worker-02   <none>           <none>
prometheus-adapter-f78c4f4ff-xr49j     1/1     Running   0          23s   10.42.1.77       kube-worker-01   <none>           <none>
grafana-7bcf47fbcb-jhl4x               1/1     Running   0          31s   10.42.2.92       kube-worker-02   <none>           <none>
prometheus-k8s-0                       3/3     Running   0          13s   10.42.0.83       kube-master-1    <none>           <none>

Is there a way to solve this problem? Does my PV need to be configured in a certain way to work with grafana and prometheus?

carlosedp commented 4 years ago

Have you created the PVs manualy? Do you have the logs from Prometheus and Grafana?

carlosedp commented 4 years ago

Check your directory permissions. Grafana creates files with UID:GID 472:472 and Prometheus with 1000:0. For debugging, do a chmod -R 777 [dir]

Henrik-Wo commented 4 years ago
  1. Yes, I created the PVs manualy. I'm not sure which logs you mean, but I hope they are:

    % make deploy
    echo "Deploying stack setup manifests..."
    Deploying stack setup manifests...
    kubectl apply -f ./manifests/setup/
    namespace/monitoring created
    customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
    customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
    customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
    customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
    customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
    customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created
    clusterrole.rbac.authorization.k8s.io/prometheus-operator created
    clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created
    deployment.apps/prometheus-operator created
    service/prometheus-operator created
    serviceaccount/prometheus-operator created
    echo "Will wait 10 seconds to deploy the additional manifests.."
    Will wait 10 seconds to deploy the additional manifests..
    sleep 10
    kubectl apply -f ./manifests/
    alertmanager.monitoring.coreos.com/main created
    secret/alertmanager-main created
    service/alertmanager-main created
    serviceaccount/alertmanager-main created
    servicemonitor.monitoring.coreos.com/alertmanager created
    clusterrole.rbac.authorization.k8s.io/arm-exporter created
    clusterrolebinding.rbac.authorization.k8s.io/arm-exporter created
    daemonset.apps/arm-exporter created
    service/arm-exporter created
    serviceaccount/arm-exporter created
    servicemonitor.monitoring.coreos.com/arm-exporter created
    secret/grafana-config created
    secret/grafana-datasources created
    configmap/grafana-dashboard-apiserver created
    configmap/grafana-dashboard-cluster-total created
    configmap/grafana-dashboard-controller-manager created
    configmap/grafana-dashboard-coredns-dashboard created
    configmap/grafana-dashboard-k8s-resources-cluster created
    configmap/grafana-dashboard-k8s-resources-namespace created
    configmap/grafana-dashboard-k8s-resources-node created
    configmap/grafana-dashboard-k8s-resources-pod created
    configmap/grafana-dashboard-k8s-resources-workload created
    configmap/grafana-dashboard-k8s-resources-workloads-namespace created
    configmap/grafana-dashboard-kubelet created
    configmap/grafana-dashboard-kubernetes-cluster-dashboard created
    configmap/grafana-dashboard-namespace-by-pod created
    configmap/grafana-dashboard-namespace-by-workload created
    configmap/grafana-dashboard-node-cluster-rsrc-use created
    configmap/grafana-dashboard-node-rsrc-use created
    configmap/grafana-dashboard-nodes created
    configmap/grafana-dashboard-persistentvolumesusage created
    configmap/grafana-dashboard-pod-total created
    configmap/grafana-dashboard-prometheus-dashboard created
    configmap/grafana-dashboard-prometheus-remote-write created
    configmap/grafana-dashboard-prometheus created
    configmap/grafana-dashboard-proxy created
    configmap/grafana-dashboard-scheduler created
    configmap/grafana-dashboard-statefulset created
    configmap/grafana-dashboard-traefik-dashboard created
    configmap/grafana-dashboard-workload-total created
    configmap/grafana-dashboards created
    deployment.apps/grafana created
    service/grafana created
    serviceaccount/grafana created
    servicemonitor.monitoring.coreos.com/grafana created
    persistentvolumeclaim/grafana-storage created
    ingress.extensions/alertmanager-main created
    ingress.extensions/grafana created
    ingress.extensions/prometheus-k8s created
    clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
    clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
    deployment.apps/kube-state-metrics created
    service/kube-state-metrics created
    serviceaccount/kube-state-metrics created
    servicemonitor.monitoring.coreos.com/kube-state-metrics created
    clusterrole.rbac.authorization.k8s.io/node-exporter created
    clusterrolebinding.rbac.authorization.k8s.io/node-exporter created
    daemonset.apps/node-exporter created
    service/node-exporter created
    serviceaccount/node-exporter created
    servicemonitor.monitoring.coreos.com/node-exporter created
    apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
    clusterrole.rbac.authorization.k8s.io/prometheus-adapter created
    clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
    clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created
    clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created
    clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created
    configmap/adapter-config created
    deployment.apps/prometheus-adapter created
    rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created
    service/prometheus-adapter created
    serviceaccount/prometheus-adapter created
    clusterrole.rbac.authorization.k8s.io/prometheus-k8s created
    clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created
    endpoints/kube-controller-manager-prometheus-discovery created
    service/kube-controller-manager-prometheus-discovery created
    service/kube-dns-prometheus-discovery created
    endpoints/kube-scheduler-prometheus-discovery created
    service/kube-scheduler-prometheus-discovery created
    servicemonitor.monitoring.coreos.com/prometheus-operator created
    prometheus.monitoring.coreos.com/k8s created
    rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created
    rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
    rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
    rolebinding.rbac.authorization.k8s.io/prometheus-k8s created
    role.rbac.authorization.k8s.io/prometheus-k8s-config created
    role.rbac.authorization.k8s.io/prometheus-k8s created
    role.rbac.authorization.k8s.io/prometheus-k8s created
    role.rbac.authorization.k8s.io/prometheus-k8s created
    prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created
    service/prometheus-k8s created
    serviceaccount/prometheus-k8s created
    servicemonitor.monitoring.coreos.com/prometheus created
    servicemonitor.monitoring.coreos.com/kube-apiserver created
    servicemonitor.monitoring.coreos.com/coredns created
    servicemonitor.monitoring.coreos.com/kube-controller-manager created
    servicemonitor.monitoring.coreos.com/kube-scheduler created
    servicemonitor.monitoring.coreos.com/kubelet created
    servicemonitor.monitoring.coreos.com/traefik created
  2. I found out that grafana has not created a directory at all and the permissions for prometheus look like this:

    $ ls -ld /mnt/usbRAID/monitoring/prometheus-db
    drwxr-xr-x 2 root root 4096 Jun 23 18:35 /mnt/usbRAID/monitoring/prometheus-db

Hope this helps narrow down the problem, still new to k8s.

Henrik-Wo commented 4 years ago

Did a chmod -R 777 [dir] on the prometheus directory.

ls -ld /mnt/usbRAID/monitoring/prometheus-db
drwxrwxrwx 2 root root 4096 Jun 23 18:35 /mnt/usbRAID/monitoring/prometheus-db

Afterwards I did a make deploy again. But the result is still the same.

carlosedp commented 4 years ago

Check your dir permissions, then deploy the stack. If it's already running, restart the pods

Henrik-Wo commented 4 years ago

ok, fixed the permissions and Grafana is now running. The PVs also appear to be correctly connected. But the Prometheus pod is still in CrashLoopBackOff.

The log is as follows:

% kubectl logs prometheus-k8s-0 -n monitoring -c prometheus
level=info ts=2020-07-27T16:21:49.449Z caller=main.go:337 msg="Starting Prometheus" version="(version=2.19.1, branch=HEAD, revision=eba3fdcbf0d378b66600281903e3aab515732b39)"
level=info ts=2020-07-27T16:21:49.449Z caller=main.go:338 build_context="(go=go1.14.4, user=root@62700b3d0ef9, date=20200618-17:44:42)"
level=info ts=2020-07-27T16:21:49.449Z caller=main.go:339 host_details="(Linux 4.19.118-v7l+ #1311 SMP Mon Apr 27 14:26:42 BST 2020 armv7l prometheus-k8s-0 (none))"
level=info ts=2020-07-27T16:21:49.449Z caller=main.go:340 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-07-27T16:21:49.449Z caller=main.go:341 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-07-27T16:21:49.451Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0xbeb3c891, 0xb, 0x14, 0x23e0968, 0x4d331a0, 0x23e0968)
    /app/promql/query_logger.go:117 +0x2fc
main.main()
    /app/cmd/prometheus/main.go:368 +0x42dc
% kubectl logs -n monitoring prometheus-k8s-0 -c prometheus-config-reloader
ts=2020-07-28T10:01:43.630061122Z caller=main.go:87 msg="Starting prometheus-config-reloader version ''."
level=error ts=2020-07-28T10:01:43.656100016Z caller=runutil.go:98 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp 127.0.0.1:9090: connect: connection refused"
% kubectl logs -n monitoring prometheus-k8s-0 -c rules-configmap-reloader  
2020/07/28 10:01:45 Watching directory: "/etc/prometheus/rules/prometheus-k8s-rulefiles-0"
carlosedp commented 4 years ago

Apparently you have permission errors on your PVs. Check the backend that provides storage as it's dependent on this.

Henrik-Wo commented 4 years ago

ok, found a workaround to fix my problem! But still think there ist something going wrong during the deployment of the stack.

  1. I created the directories for prometheus and Grafana in the backend
    $ sudo mkdir <div>/monitoring/prometheus
    $ sudo mkdir <div>/monitoring/grafana
  2. Changed the permissions for both directories
    $ sudo chown -R 1000:0 <div>/monitoring/prometheus/
    $ sudo chown -R 472:472 <div>/monitoring/grafana/
  3. Make the directories available via PVs and deploy the hole stack.

Grafana is now running but Prometheus is stuck in CrashLoopBackOff.

  1. Go back to the backend and change the permissions for Prometheus again
    $ sudo chown -R 1000:0 <div>/monitoring/prometheus/

During the next restart of the Prometheus pod it catches on and starts running.

carlosedp commented 4 years ago

Must be something on the backend. I've tested with some NFS and K3s local storage and it works fine.

Gonna take another look soon.

aneeldadani commented 4 years ago

I ran into the same issue. I have updated the filesystem with sudo chmod -R 777 [dir] and this allowed the pods to go to a "Running" state. What are the recommended permissions?

carlosedp commented 4 years ago

It's recommended that the cluster has access to the mount point with full access so it can manage (read, write and execute) it's files.

carlosedp commented 4 years ago

Closing this as problem is related to backend permissions.