Closed MichaelKora closed 3 months ago
also creating a PV manually and binding it to the PVC works..but its not optimal
Ah sorry didn't see you replied to the old issue. Is the ceph cluster in a healthy state?
Easiest way to check is probably through the dashboard
Might be some relevant troubleshooting tips here: https://github.com/rook/rook/issues/7756
ok will go through the materials..and will let you know how it went. thanks
@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the OSDs
pods are not deployed... they are not scheduled at all. also the CephCluster
object shows following warning: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"
...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)
@MathiasPius also i ended up using the helm chart to install the operators since i was getting these errors
helmrepository
: Warning Failed 64s source-controller failed to fetch Helm repository index: failed to cache index to temporary file: Get "https://charts.rook.io/release/index.yaml": dial tcp 18.239.69.97:443: i/o timeout
helm-release
: HelmChart 'rook-ceph/rook-ceph-rook-ceph' is not ready: latest generation of object has not been reconciled
here my .yaml
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: rook-release
namespace: rook-ceph
spec:
interval: 5m0s
url: https://charts.rook.io/release
---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
interval: 5m
chart:
spec:
chart: rook-ceph
version: ">=v1.13.0 <v1.14.0"
sourceRef:
kind: HelmRepository
name: rook-release
namespace: rook-ceph
interval: 1m
values:
crds:
enabled: true
enableDiscoveryDaemon: true
@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the
OSDs
pods are not deployed... they are not scheduled at all. also theCephCluster
object shows following warning:User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"
...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)
@MathiasPius
kg cephcluster -n rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
its-rook-ceph-cluster /var/lib/rook 3 25m Ready Cluster created successfully HEALTH_WARN 1afd497a
bash-4.4$ ceph status
cluster:
id: xxxxxxxxxx
health: HEALTH_WARN
29 daemons have recently crashed
OSD count 0 < osd_pool_default_size 3
services:
mon: 3 daemons, quorum a,b,c (age 27m)
mgr: b(active, since 27m), standbys: a
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
What do the logs of the discovery pods say? Rook should spawn discovery pods on each node to determine which of the drives are available for use as OSDs
What's the discovery pods name ? Am not seeing any 🤔
In my cluster they're called:
rook-ceph-osd-prepare-n1-db27h 0/1 Completed 0 7d22h
rook-ceph-osd-prepare-n2-tfmvl 0/1 Completed 0 7d22h
rook-ceph-osd-prepare-n3-4kxjk 0/1 Completed 0 7d22h
Where the n1-3 are the node names.
@MathiasPius So i got the cephcluster to run and to be in an healthy state..i am using pretty much the same configs as you..but the
OSDs
pods are not deployed... they are not scheduled at all. also theCephCluster
object shows following warning:User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot create resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"
...it added a clusterrole and bound it to the named SA but this didnt fix the issue...(i dont if both behaviors are connected)
This seems unrelated to the issue of provisioning OSDs, but looks like the issue is mentioned here pointing to this documentation
In my cluster they're called:
rook-ceph-osd-prepare-n1-db27h 0/1 Completed 0 7d22h rook-ceph-osd-prepare-n2-tfmvl 0/1 Completed 0 7d22h rook-ceph-osd-prepare-n3-4kxjk 0/1 Completed 0 7d22h
Where the n1-3 are the node names.
I don't have them running in my cluster. That's what I don't understand🫤
Try grepping for prepare
in the operator pod's logs. In mine it shows:
2024-04-03 20:28:14.360120 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n1 to start a new one
2024-04-03 20:28:14.375938 I | op-k8sutil: batch job rook-ceph-osd-prepare-n1 still exists
2024-04-03 20:28:17.381855 I | op-k8sutil: batch job rook-ceph-osd-prepare-n1 deleted
2024-04-03 20:28:17.563265 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n2 to start a new one
2024-04-03 20:28:17.583691 I | op-k8sutil: batch job rook-ceph-osd-prepare-n2 still exists
2024-04-03 20:28:20.587999 I | op-k8sutil: batch job rook-ceph-osd-prepare-n2 deleted
2024-04-03 20:28:21.355654 I | op-k8sutil: Removing previous job rook-ceph-osd-prepare-n3 to start a new one
2024-04-03 20:28:21.368702 I | op-k8sutil: batch job rook-ceph-osd-prepare-n3 still exists
2024-04-03 20:28:24.374258 I | op-k8sutil: batch job rook-ceph-osd-prepare-n3 deleted
~$ k logs rook-ceph-operator-54d8747b6d-67fxt -n rook-ceph | grep prepare
~$
@MathiasPius no lines about prepare..somehow the operator doesnt even try to deploy OSDs...as if i disabled it...
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: my-rook-ceph-cluster
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v18.2.2
allowUnsupported: false
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
mgr:
modules:
- name: rook
enabled: true
count: 2
allowMultiplePerNode: false
dashboard:
enabled: true
removeOSDsIfOutAndSafeToRemove: false
storage:
config:
onlyApplyOSDPlacement: true
useAllDevices: true
useAllNodes: false
nodes:
- name: "vm-1"
resources:
limits:
cpu: "2"
memory: "4096Mi"
requests:
cpu: "1"
memory: "2048Mi"
- name: "vm-2"
resources:
limits:
cpu: "2"
memory: "4096Mi"
requests:
cpu: "1"
memory: "2048Mi"
- name: "vm-3"
resources:
limits:
cpu: "2"
memory: "4096Mi"
requests:
cpu: "1"
memory: "2048Mi"
placement:
all:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
cephConfig:
global:
osd_pool_default_size: "3"
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
monitoring:
enabled: true
metricsDisabled: false
healthCheck:
daemonHealth:
mon:
disabled: false
interval: 45s
osd:
disabled: false
interval: 60s
status:
disabled: false
interval: 60s
livenessProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
startupProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
$ k logs rook-ceph-operator-54d8747b6d-67fxt -n rook-ceph | grep osd
2024-04-11 17:37:46.241116 I | op-config: failed to set keys [osd_pool_default_size], trying to remove them first
2024-04-11 17:37:46.241151 I | op-config: deleting "global" "osd_pool_default_size" option from the mon configuration database
2024-04-11 17:37:46.662746 I | op-config: successfully deleted "osd_pool_default_size" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:09.665143 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:10.107960 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:31.951372 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:32.369636 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:38:52.385796 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:38:52.832277 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:16.133398 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:16.559750 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:36.495297 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:36.912677 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:39:57.246863 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:39:57.677154 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:40:17.895025 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:40:18.320489 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:40:40.077292 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:40:40.505808 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:01.438973 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:01.873401 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:23.804441 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:24.216425 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:41:48.934122 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:41:49.353212 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:42:18.776572 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:42:19.187305 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:42:59.165994 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:42:59.599509 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:44:00.589143 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:44:01.018028 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:45:43.365561 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:45:43.791873 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:48:49.702683 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:48:50.104521 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:49:02.218511 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
2024-04-11 17:49:02.826564 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-04-11 17:49:02.826568 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-04-11 17:51:35.030254 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
2024-04-11 17:51:35.442707 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-04-11 17:51:35.442790 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-04-11 17:52:24.903131 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:52:25.320156 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:54:12.677874 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:54:13.075430 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:54:44.862607 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:54:45.301929 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:55:17.854259 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:55:18.289486 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:55:49.797489 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:55:50.200859 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:56:22.757595 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:56:23.220163 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:56:55.354525 I | op-config: deleting "global" "ms_osd_compress_mode" option from the mon configuration database
2024-04-11 17:56:55.776590 I | op-config: successfully deleted "ms_osd_compress_mode" option from the mon configuration database
osd_pool_default_size = 3
2024-04-11 17:57:05.803464 I | op-osd: stopping monitoring of OSDs in namespace "rook-ceph"
- LivenessProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
- StartupProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
+ LivenessProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
+ StartupProbe: map[v1.KeyType]*v1.ProbeSpec{"mgr": &{}, "mon": &{}, "osd": &{}},
That is odd. I'm assuming the VMs have extra unformatted storage besides their root disk attached?
https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph/values.yaml#L536 am i missing something here ?
In that case there shouldn't be an issue, but you can make sure using the talosctl disks
command against one of your nodes. It should list the mounted disks.
Honestly I'm not entirely sure why these nodes are not picked up for storage, I think it's more of a Rook-specific issue. I'd see if you can't find someone with a little more Rook experience in their slack: https://slack.rook.io/
NODE DEV MODEL SERIAL TYPE UUID WWID MODALIAS NAME SIZE BUS_PATH SUBSYSTEM READ_ONLY SYSTEM_DISK
xxxxx /dev/sda Virtual Disk - HDD - naa.xxx scsi:t-0x00 - 136 GB /LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/8eb8833b-45b9/host0/target0:0:0/0:0:0:0/ /sys/class/block *
Try attaching more virtual hard disks, since Ceph needs (at least) one entire disk to set up an OSD
Edit: I think it is possible to point Ceph at a single partition or LVM partition, but it's easier to use an entire disk.
adding more disks didnt help...no osd
or prepare
found in the operator logs
when i create a PVC it is pending..looking in the logs of the provisioner shows
pvc-68745313-91b4-4c06-8f22-c8c95faa4833 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-68745313-91b4-4c06-8f22-c8c95faa4833 already exists
But the
PVC
cleary did not exist.. @MathiasPius any ideas where the error can lie?