Closed acziryak closed 1 year ago
@acziryak not sure how you ended in this case. Do you have any clear reproducer and complete logs to understand what went wrong?
I can rebuild the Nomad cluster, but cannot rebuild the Ceph cluster. However, I can create a new pool/client for the ceph cluster.
Is there a way that I can manually test retrieving the omap values?
See if https://www.mrajanna.com/tracking-pv-rados-omap-in-cephcsi/ helps to track the omap values. some commands are kubernetes specific I think you can replace it with nomad
So I tried to create a new volume which succeeded:
I0217 17:24:19.174991 7 omap.go:88] ID: 78467 Req-ID: agnostic2-us-ind-test got omap values: (pool="ind-nonprod2", namespace="", name="csi.volume.9f0760b0-aee6-11ed-96f6-ae765b511692"): map[csi.imageid:208269a29eef81 csi.imagename:csi-vol-9f0760b0-aee6-11ed-96f6-ae765b511692 csi.volname:agnostic2-us-ind-test]
I0217 17:24:19.216688 7 rbd_journal.go:337] ID: 78467 Req-ID: agnostic2-us-ind-test found existing volume (0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-9f0760b0-aee6-11ed-96f6-ae765b511692) with image name (csi-vol-9f0760b0-aee6-11ed-96f6-ae765b511692) for request (agnostic2-us-ind-test)
I0217 17:24:19.217186 7 utils.go:212] ID: 78467 Req-ID: agnostic2-us-ind-test GRPC response: {"volume":{"capacity_bytes":10737418240,"volume_context":{"clusterID":"1e35f6bc-1257-45b6-aa9d-16f9ecd30652","imageFeatures":"layering","imageName":"csi-vol-9f0760b0-aee6-11ed-96f6-ae765b511692","journalPool":"ind-nonprod2","pool":"ind-nonprod2"},"volume_id":"0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-9f0760b0-aee6-11ed-96f6-ae765b511692"}}
However, I cannot find the omap keys for that volume on my monitor node:
rados listomapkeys csi.volume.9f0760b0-aee6-11ed-96f6-ae765b511692
usage: rados [options] [commands]
This is per the troubleshooting page linked.
EDIT: It looks like it requires the pool name:
rados -p ind-nonprod2 listomapkeys csi.volume.9f0760b0-aee6-11ed-96f6-ae765b511692
csi.imageid
csi.imagename
csi.volname
yes it require pool name and also you can list values as well listomapvals
if no problem exists can we close this one?
Yes.
I will note down here that after I did a nomad system gc
and nomad system reconcile summaries
and a full stop and start of the CSI nodes and controllers, I was able to have the volume created. So whatever that inconsistent state error was must have been something that Nomad was caching somewhere that a garbage collection and restart of the jobs solved. Hope that might help whoever might find their way here later. And thank you for your willingness to help @Madhu-1 .
This is happening again, and the above commands aren't fixing it.
I wonder if there's some cache somewhere that's not being cleared.
Describe the bug
Creating a nomad volume with ceph csi results in state inconsistent, omap names mismatch
error:https://github.com/ceph/ceph-csi/blob/devel/internal/journal/voljournal.go#L369-L375
Environment details
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : krbdSteps to reproduce
Steps to reproduce the behavior:
nomad volume create
At this point I'm trying to figure out where in the stack this is a problem; if it's a user permission problem, or if ceph is corrupted, or if Nomad is having issues.
FWIW, I see this at the beginning of the node's logs:
I'm not sure if that's related or not, especially because I see the volume get created on the Ceph cluster. However, I can't seem to get the omap values from that volume:
The comment by the error that I'm seeing is not promising: