MathiasPius / kronform

Public configuration for Kubernetes cluster hosted with Hetzner.
https://datavirke.dk/posts/bare-metal-kubernetes-part-1-talos-on-hetzner/
MIT License
76 stars 9 forks source link

PVC creation errors #3

Closed MichaelKora closed 3 months ago

MichaelKora commented 5 months ago

when i create a PVC it is pending..looking in the logs of the provisioner shows pvc-68745313-91b4-4c06-8f22-c8c95faa4833 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-68745313-91b4-4c06-8f22-c8c95faa4833 already exists

But the PVC cleary did not exist.. @MathiasPius any ideas where the error can lie?

MichaelKora commented 5 months ago

also creating a PV manually and binding it to the PVC works..but its not optimal

MathiasPius commented 5 months ago

Ah sorry didn't see you replied to the old issue. Is the ceph cluster in a healthy state?

Easiest way to check is probably through the dashboard

Might be some relevant troubleshooting tips here: https://github.com/rook/rook/issues/7756

MichaelKora commented 5 months ago

ok will go through the materials..and will let you know how it went. thanks

MichaelKora commented 5 months ago

@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the OSDs pods are not deployed... they are not scheduled at all. also the CephCluster object shows following warning: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)

MichaelKora commented 5 months ago

@MathiasPius also i ended up using the helm chart to install the operators since i was getting these errors helmrepository: Warning Failed 64s source-controller failed to fetch Helm repository index: failed to cache index to temporary file: Get "https://charts.rook.io/release/index.yaml": dial tcp 18.239.69.97:443: i/o timeout

helm-release: HelmChart 'rook-ceph/rook-ceph-rook-ceph' is not ready: latest generation of object has not been reconciled

MichaelKora commented 5 months ago

here my .yaml

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: rook-release
  namespace: rook-ceph
spec:
  interval: 5m0s
  url: https://charts.rook.io/release
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  interval: 5m
  chart:
    spec:
      chart: rook-ceph
      version: ">=v1.13.0 <v1.14.0"
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: rook-ceph
      interval: 1m
  values:
    crds:
      enabled: true
    enableDiscoveryDaemon: true
MichaelKora commented 5 months ago

@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the OSDs pods are not deployed... they are not scheduled at all. also the CephCluster object shows following warning: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)

@MathiasPius

kg cephcluster -n rook-ceph
NAME                    DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH        EXTERNAL   FSID
its-rook-ceph-cluster   /var/lib/rook     3          25m   Ready   Cluster created successfully   HEALTH_WARN              1afd497a
MichaelKora commented 5 months ago
bash-4.4$ ceph status
  cluster:
    id:     xxxxxxxxxx
    health: HEALTH_WARN
            29 daemons have recently crashed
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 3 daemons, quorum a,b,c (age 27m)
    mgr: b(active, since 27m), standbys: a
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:
MathiasPius commented 5 months ago

What do the logs of the discovery pods say? Rook should spawn discovery pods on each node to determine which of the drives are available for use as OSDs

MichaelKora commented 5 months ago

What's the discovery pods name ? Am not seeing any 🤔

MathiasPius commented 5 months ago

In my cluster they're called:

rook-ceph-osd-prepare-n1-db27h                  0/1     Completed   0             7d22h
rook-ceph-osd-prepare-n2-tfmvl                  0/1     Completed   0             7d22h
rook-ceph-osd-prepare-n3-4kxjk                  0/1     Completed   0             7d22h

Where the n1-3 are the node names.

MathiasPius commented 5 months ago

@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the OSDs pods are not deployed... they are not scheduled at all. also the CephCluster object shows following warning: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)

This seems unrelated to the issue of provisioning OSDs, but looks like the issue is mentioned here pointing to this documentation

MichaelKora commented 5 months ago

In my cluster they're called:


rook-ceph-osd-prepare-n1-db27h                  0/1     Completed   0             7d22h

rook-ceph-osd-prepare-n2-tfmvl                  0/1     Completed   0             7d22h

rook-ceph-osd-prepare-n3-4kxjk                  0/1     Completed   0             7d22h

Where the n1-3 are the node names.

I don't have them running in my cluster. That's what I don't understand🫤

MathiasPius commented 5 months ago

Try grepping for prepare in the operator pod's logs. In mine it shows:

2024-04-03 20:28:14.360120 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n1 to start a new one
2024-04-03 20:28:14.375938 I | op-k8sutil: batch job rook-ceph-osd-prepare-n1 still exists
2024-04-03 20:28:17.381855 I | op-k8sutil: batch job rook-ceph-osd-prepare-n1 deleted
2024-04-03 20:28:17.563265 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n2 to start a new one
2024-04-03 20:28:17.583691 I | op-k8sutil: batch job rook-ceph-osd-prepare-n2 still exists
2024-04-03 20:28:20.587999 I | op-k8sutil: batch job rook-ceph-osd-prepare-n2 deleted
2024-04-03 20:28:21.355654 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n3 to start a new one
2024-04-03 20:28:21.368702 I | op-k8sutil: batch job rook-ceph-osd-prepare-n3 still exists
2024-04-03 20:28:24.374258 I | op-k8sutil: batch job rook-ceph-osd-prepare-n3 deleted
MichaelKora commented 5 months ago
~$ k logs rook-ceph-operator-54d8747b6d-67fxt -n rook-ceph | grep prepare

~$

@MathiasPius no lines about prepare..somehow the operator doesnt even try to deploy OSDs...as if i disabled it...

MichaelKora commented 5 months ago
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: my-rook-ceph-cluster
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v18.2.2
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    modules:
      - name: rook
        enabled: true
    count: 2
    allowMultiplePerNode: false
  dashboard:
    enabled: true

  removeOSDsIfOutAndSafeToRemove: false
  storage:
    config:
    onlyApplyOSDPlacement: true
    useAllDevices: true
    useAllNodes: false
    nodes:
      - name: "vm-1"
        resources:
          limits:
            cpu: "2"
            memory: "4096Mi"
          requests:
            cpu: "1"
            memory: "2048Mi"

      - name: "vm-2"
        resources:
          limits:
            cpu: "2"
            memory: "4096Mi"
          requests:
            cpu: "1"
            memory: "2048Mi"

      - name: "vm-3"
        resources:
          limits:
            cpu: "2"
            memory: "4096Mi"
          requests:
            cpu: "1"
            memory: "2048Mi"
  placement:
    all:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Exists
  cephConfig:
    global:
      osd_pool_default_size: "3"
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
  monitoring:
    enabled: true
    metricsDisabled: false
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
MichaelKora commented 5 months ago
$ k logs rook-ceph-operator-54d8747b6d-67fxt -n rook-ceph | grep osd

2024-04-11 17:37:46.241116 I | op-config: failed to set keys [osd_pool_default_size], trying to remove them first
2024-04-11 17:37:46.241151 I | op-config: deleting "global" "osd_pool_default_size" option from the mon configuration database
2024-04-11 17:37:46.662746 I | op-config: successfully deleted "osd_pool_default_size" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:09.665143 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:10.107960 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:31.951372 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:32.369636 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:52.385796 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:52.832277 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:16.133398 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:16.559750 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:36.495297 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:36.912677 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:57.246863 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:57.677154 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:40:17.895025 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:40:18.320489 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:40:40.077292 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:40:40.505808 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:01.438973 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:01.873401 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:23.804441 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:24.216425 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:48.934122 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:49.353212 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:42:18.776572 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:42:19.187305 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:42:59.165994 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:42:59.599509 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:44:00.589143 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:44:01.018028 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:45:43.365561 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:45:43.791873 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:48:49.702683 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:48:50.104521 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:49:02.218511 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
2024-04-11 17:49:02.826564 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-04-11 17:49:02.826568 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-04-11 17:51:35.030254 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
2024-04-11 17:51:35.442707 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-04-11 17:51:35.442790 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-04-11 17:52:24.903131 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:52:25.320156 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:54:12.677874 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:54:13.075430 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:54:44.862607 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:54:45.301929 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:55:17.854259 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:55:18.289486 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:55:49.797489 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:55:50.200859 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:56:22.757595 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:56:23.220163 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:56:55.354525 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:56:55.776590 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:57:05.803464 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
-               LivenessProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
-               StartupProbe:  map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
+               LivenessProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
+               StartupProbe:  map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
MathiasPius commented 5 months ago

That is odd. I'm assuming the VMs have extra unformatted storage besides their root disk attached?

MichaelKora commented 5 months ago

https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L536 am i missing something here ?

MathiasPius commented 5 months ago

In that case there shouldn't be an issue, but you can make sure using the talosctl disks command against one of your nodes. It should list the mounted disks.

Honestly I'm not entirely sure why these nodes are not picked up for storage, I think it's more of a Rook-specific issue. I'd see if you can't find someone with a little more Rook experience in their slack: https://slack.rook.io/

MichaelKora commented 5 months ago
NODE           DEV        MODEL          SERIAL   TYPE   UUID   WWID                                   MODALIAS      NAME   SIZE     BUS_PATH                                                                                                        SUBSYSTEM          READ_ONLY   SYSTEM_DISK
xxxxx   /dev/sda   Virtual Disk   -        HDD    -      naa.xxx   scsi:t-0x00   -      136 GB   /LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/8eb8833b-45b9/host0/target0:0:0/0:0:0:0/   /sys/class/block               *
MathiasPius commented 5 months ago

Try attaching more virtual hard disks, since Ceph needs (at least) one entire disk to set up an OSD

Edit: I think it is possible to point Ceph at a single partition or LVM partition, but it's easier to use an entire disk.

MichaelKora commented 5 months ago

adding more disks didnt help...no osd or prepare found in the operator logs