ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.29k stars 548 forks source link

Qualify or move RBD NBD to full support #667

Open humblec opened 5 years ago

humblec commented 5 years ago

Eventhough we have support for rbd nbd in CSI , it is not the default or not qualified much. This is a feature request to bring more qualification to rbd-nbd and make it fully supported with RBD CSI. The main issue we have come across on user space based client/mounter ( fuse, rbd-nbd) is the unavailability or disturbed mounts on workload attached shares/volumes when the CSI pods are respawned. This is the main technical limitation we have atm.

Reference issue# https://github.com/ceph/ceph-csi/issues/792

Here are the Items needed to move this issue to fixed:

Ceph CSI v3.4.0 Progress: [Alpha]

v3.4.0 Beyond:

humblec commented 5 years ago

557

dillaman commented 5 years ago

Perhaps the biggest current limitation of NBD is that the rbd-nbd daemon is run within the ceph-csi container. Upgrading the container is not possible w/o killing running workloads.

There are plans to support multiple NBD block devices under a single rbd-nbd process. Additionally, multiple rbd-nbd daemons could be attached to the same block device to provide round-robin IO support (multipath). If rbd-nbd was moved to its own container, was run in a set of at least 2, and an upgrade would only bring down one rbd-nbd daemon concurrently, we could address those concerns.

However, this now implies that we need a way to send messages from the ceph-csi node attacher to multiple rbd-nbd containers on the same node. We also need a way for a restarted rbd-nbd daemon to re-discover which block devices are tied to which RBD images (and associated credentials). One possibility is to expose this via the existing admin socket capabilities of the daemon and allow the daemons to query other rbd-nbd daemons to rebuild their state upon startup.

CC: @mikechristie

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

mykaul commented 4 years ago

Perhaps the biggest current limitation of NBD is that the rbd-nbd daemon is run within the ceph-csi container. Upgrading the container is not possible w/o killing running workloads.

There are plans to support multiple NBD block devices under a single rbd-nbd process. Additionally, multiple rbd-nbd daemons could be attached to the same block device to provide round-robin IO support (multipath). If rbd-nbd was moved to its own container, was run in a set of at least 2, and an upgrade would only bring down one rbd-nbd daemon concurrently, we could address those concerns.

However, this now implies that we need a way to send messages from the ceph-csi node attacher to multiple rbd-nbd containers on the same node. We also need a way for a restarted rbd-nbd daemon to re-discover which block devices are tied to which RBD images (and associated credentials). One possibility is to expose this via the existing admin socket capabilities of the daemon and allow the daemons to query other rbd-nbd daemons to rebuild their state upon startup.

CC: @mikechristie

We can't upgrade the RBD client either without killing the container as long as we are using kRBD, but I assume the concern is the reverse - about our inability to upgrade the ceph-csi container? I'm unsure - I believe we can upgrade it, but not change existing users to use it until they restart?

In any case, I believe we should unstale this and think about it some more.

dillaman commented 4 years ago

We can't upgrade the RBD client either without killing the container as long as we are using kRBD, but I assume the concern is the reverse - about our inability to upgrade the ceph-csi container? I'm unsure - I believe we can upgrade it, but not change existing users to use it until they restart?

In any case, I believe we should unstale this and think about it some more.

Restarting the ceph-csi pod when using krbd won't result in IO failures of other pods using RBD PVs. We do have a solution to prevent IO failures if the rbd-nbd daemon is stopped/killed and the ability to re-attach a new rbd-nbd daemon to an orphaned nbd device. Next step is to persistently record (on tmpfs) live rbd-nbd mappings so a restart of the daemon can automatically recover.

The one missing piece is that a restarted rbd-nbd will need the cluster credentials to reconnect which ceph-csi doesn't currently save and instead receives on node CSI actions via secrets. It's not unsolvable, just need to have an agreement on the division of responsibility, though.

pkalever commented 3 years ago

Here are the Items needed to move this issue to fixed:

humblec commented 3 years ago

I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.

pkalever commented 3 years ago

I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.

We have tied this issue with 2 tasks. 1. volume healer, 2. design doc.

The 2nd one https://github.com/ceph/ceph-csi/pull/2275 is still in the open state.

humblec commented 3 years ago

I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.

We have tied this issue with 2 tasks. 1. volume healer, 2. design doc.

The 2nd one #2275 is still in the open state.

@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for beta or above support. so proper tracking can be done with each stage

pkalever commented 3 years ago

@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for beta or above support. so proper tracking can be done with each stage

Sure, I can open a new issue to track the list of tasks that needs to move rbd-nbd mounter to Beta support, if that's what you meant.

humblec commented 3 years ago

@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for beta or above support. so proper tracking can be done with each stage

Sure, I can open a new issue to track the list of tasks that needs to move rbd-nbd mounter to Beta support, if that's what you meant.

@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here https://github.com/ceph/ceph-csi/pull/2275#issuecomment-887508997 and add it as a subtask we can use this issue for tracking till we have a full support verison

pkalever commented 3 years ago

@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here #2275 (comment) and add it as a subtracker we can use this issue for tracker

I would rather track it with a fresh issue to make other's life easy.

I have opened a new tracker to track the beta status and linked the issues there, PTAL https://github.com/ceph/ceph-csi/issues/2323

If you still think that is not enough, I'm fine with whatever works with others.

humblec commented 3 years ago

@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here #2275 (comment) and add it as a subtracker we can use this issue for tracker

I would rather track it with a fresh issue to make other's life easy.

I have opened a new tracker to track the beta status and linked the issues there, PTAL

2323

If you still think that is not enough, I'm fine with whatever works with others.

Lets keep this open and have trackers pointing here as this issue exist and pointed out in other issues raised in this area. I have also updated the description section with the progress, so that helps to get the summary.

pkalever commented 2 years ago

@humblec how can we move forward with this? do we have a plan?

cc: @Madhu-1

mykaul commented 2 years ago

@pkalever - do we have some performance comparison numbers between the two implementations?

pkalever commented 2 years ago

@pkalever - do we have some performance comparison numbers between the two implementations?

@mykaul sorry nope! I will plan to work on this and make sure to link here.

Madhu-1 commented 2 years ago
  • Attacher dependency I guess we discussed this before and this cannot be avoided at the moment.

are we still continuing on removing attacher?

  • CSI leak As @nixpanic already mentioned, this is not just with the healer, but also with many other things.

Moving the Kubernetes logic to a new sidecar will hide the implementation from the cephcsi.

cc: @Madhu-1

humblec commented 2 years ago

@humblec how can we move forward with this? do we have a plan?

* Attacher dependency
  I guess we discussed this before and this cannot be avoided at the moment.

* CSI leak
  As @nixpanic already mentioned, this is not just with the healer, but also with many other things.

@pkalever sure, let me update with the process which I am thinking of.

humblec commented 2 years ago

@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?

pkalever commented 2 years ago

@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?

We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.

humblec commented 2 years ago

@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?

We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.

Not sure what I am missing here. The trigger to check or know there were existing NBD connection is the presence of metadata files, if the metafile is present fetch the NBD device details from it. The secrets are already part of the request like nodestage and its a different field altogether in the request or iow, we dont want to fetch it from the PV at all, can that be done ? if there are some glitches to have this model and its already discussed or documented, I am happy to check on the same if you can provide some reference. The only thing which I am proposing here is the possibility of removing the dependency to va or avoiding the CSI leak which was raised by storage team.

pkalever commented 2 years ago

@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?

We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.

Not sure what I am missing here. The trigger to check or know there were existing NBD connection is the presence of metadata files, if the metafile is present fetch the NBD device details from it. The secrets are already part of the request like nodestage and its a different field altogether in the request or iow, we dont want to fetch it from the PV at all, can that be done ? if there are some glitches to have this model and its already discussed or documented, I am happy to check on the same if you can provide some reference. The only thing which I am proposing here is the possibility of removing the dependency to va or avoiding the CSI leak which was raised by storage team.

Who sends the node stage request? If you missed it, it is the healer!

where does the healer get the secrets from? From the VA and PV object details

Now, what do you want to store in the meta file and retrieve later?

pkalever commented 2 years ago

Here is the basic design doc: https://github.com/ceph/ceph-csi/blob/devel/docs/design/proposals/rbd-volume-healer.md#volume-healer