Open humblec opened 5 years ago
Perhaps the biggest current limitation of NBD is that the rbd-nbd daemon is run within the ceph-csi container. Upgrading the container is not possible w/o killing running workloads.
There are plans to support multiple NBD block devices under a single rbd-nbd process. Additionally, multiple rbd-nbd daemons could be attached to the same block device to provide round-robin IO support (multipath). If rbd-nbd was moved to its own container, was run in a set of at least 2, and an upgrade would only bring down one rbd-nbd daemon concurrently, we could address those concerns.
However, this now implies that we need a way to send messages from the ceph-csi node attacher to multiple rbd-nbd containers on the same node. We also need a way for a restarted rbd-nbd daemon to re-discover which block devices are tied to which RBD images (and associated credentials). One possibility is to expose this via the existing admin socket capabilities of the daemon and allow the daemons to query other rbd-nbd daemons to rebuild their state upon startup.
CC: @mikechristie
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Perhaps the biggest current limitation of NBD is that the rbd-nbd daemon is run within the ceph-csi container. Upgrading the container is not possible w/o killing running workloads.
There are plans to support multiple NBD block devices under a single rbd-nbd process. Additionally, multiple rbd-nbd daemons could be attached to the same block device to provide round-robin IO support (multipath). If rbd-nbd was moved to its own container, was run in a set of at least 2, and an upgrade would only bring down one rbd-nbd daemon concurrently, we could address those concerns.
However, this now implies that we need a way to send messages from the ceph-csi node attacher to multiple rbd-nbd containers on the same node. We also need a way for a restarted rbd-nbd daemon to re-discover which block devices are tied to which RBD images (and associated credentials). One possibility is to expose this via the existing admin socket capabilities of the daemon and allow the daemons to query other rbd-nbd daemons to rebuild their state upon startup.
CC: @mikechristie
We can't upgrade the RBD client either without killing the container as long as we are using kRBD, but I assume the concern is the reverse - about our inability to upgrade the ceph-csi container? I'm unsure - I believe we can upgrade it, but not change existing users to use it until they restart?
In any case, I believe we should unstale this and think about it some more.
We can't upgrade the RBD client either without killing the container as long as we are using kRBD, but I assume the concern is the reverse - about our inability to upgrade the ceph-csi container? I'm unsure - I believe we can upgrade it, but not change existing users to use it until they restart?
In any case, I believe we should unstale this and think about it some more.
Restarting the ceph-csi
pod when using krbd won't result in IO failures of other pods using RBD PVs. We do have a solution to prevent IO failures if the rbd-nbd
daemon is stopped/killed and the ability to re-attach a new rbd-nbd
daemon to an orphaned nbd
device. Next step is to persistently record (on tmpfs
) live rbd-nbd
mappings so a restart of the daemon can automatically recover.
The one missing piece is that a restarted rbd-nbd
will need the cluster credentials to reconnect which ceph-csi
doesn't currently save and instead receives on node CSI actions via secrets. It's not unsolvable, just need to have an agreement on the division of responsibility, though.
Here are the Items needed to move this issue to fixed:
I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.
I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.
We have tied this issue with 2 tasks. 1. volume healer, 2. design doc.
The 2nd one https://github.com/ceph/ceph-csi/pull/2275 is still in the open state.
I think this issue can remain open till we have atleast beta version of the support for NBD. I am not sure about the exact reason for reopening this though.
We have tied this issue with 2 tasks. 1. volume healer, 2. design doc.
The 2nd one #2275 is still in the open state.
@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for beta
or above support. so proper tracking can be done with each stage
@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for
beta
or above support. so proper tracking can be done with each stage
Sure, I can open a new issue to track the list of tasks that needs to move rbd-nbd mounter to Beta support, if that's what you meant.
@pkalever one idea would be having a sub issue for alpha support and list the items ? likely list sub tasks or pending items for
beta
or above support. so proper tracking can be done with each stageSure, I can open a new issue to track the list of tasks that needs to move rbd-nbd mounter to Beta support, if that's what you meant.
@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here https://github.com/ceph/ceph-csi/pull/2275#issuecomment-887508997 and add it as a subtask we can use this issue for tracking till we have a full support verison
@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here #2275 (comment) and add it as a subtracker we can use this issue for tracker
I would rather track it with a fresh issue to make other's life easy.
I have opened a new tracker to track the beta status and linked the issues there, PTAL https://github.com/ceph/ceph-csi/issues/2323
If you still think that is not enough, I'm fine with whatever works with others.
@pkalever this issue need to be kept open till we are out of alpha state as it says full support. so if you can create tracker issues for the concerns raised here #2275 (comment) and add it as a subtracker we can use this issue for tracker
I would rather track it with a fresh issue to make other's life easy.
I have opened a new tracker to track the beta status and linked the issues there, PTAL
2323
If you still think that is not enough, I'm fine with whatever works with others.
Lets keep this open and have trackers pointing here as this issue exist and pointed out in other issues raised in this area. I have also updated the description section with the progress, so that helps to get the summary.
@humblec how can we move forward with this? do we have a plan?
Attacher dependency I guess we discussed this before and this cannot be avoided at the moment.
CSI leak As @nixpanic already mentioned, this is not just with the healer, but also with many other things.
cc: @Madhu-1
@pkalever - do we have some performance comparison numbers between the two implementations?
@pkalever - do we have some performance comparison numbers between the two implementations?
@mykaul sorry nope! I will plan to work on this and make sure to link here.
- Attacher dependency I guess we discussed this before and this cannot be avoided at the moment.
are we still continuing on removing attacher?
- CSI leak As @nixpanic already mentioned, this is not just with the healer, but also with many other things.
Moving the Kubernetes logic to a new sidecar will hide the implementation from the cephcsi.
cc: @Madhu-1
@humblec how can we move forward with this? do we have a plan?
* Attacher dependency I guess we discussed this before and this cannot be avoided at the moment. * CSI leak As @nixpanic already mentioned, this is not just with the healer, but also with many other things.
@pkalever sure, let me update with the process which I am thinking of.
@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?
@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?
We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.
@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?
We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.
Not sure what I am missing here. The trigger to check or know there were existing NBD connection is the presence of metadata files, if the metafile is present fetch the NBD device details from it. The secrets are already part of the request like nodestage and its a different field altogether in the request or iow, we dont want to fetch it from the PV at all, can that be done ? if there are some glitches to have this model and its already discussed or documented, I am happy to check on the same if you can provide some reference. The only thing which I am proposing here is the possibility of removing the dependency to va or avoiding the CSI leak which was raised by storage team.
@pkalever one of the solution we can think of here is , getting the NBD share information on a metafile on the host at time of attach/mount and then consume that while there is a restart of the driver plugin. This way, we dont have to depend on the va objects or other hacks.. Being it on the host FS, it will be persistent across the driver restarts .. This is the same thought shared by Jan in the comment too. Can we do similar mechanism ?
We already discussed this before making the volume healer. We get the pv objects from the va list and then extract the secrets from the pv object. Since we cannot store the secrets in the metafile, this is not a good idea.
Not sure what I am missing here. The trigger to check or know there were existing NBD connection is the presence of metadata files, if the metafile is present fetch the NBD device details from it. The secrets are already part of the request like nodestage and its a different field altogether in the request or iow, we dont want to fetch it from the PV at all, can that be done ? if there are some glitches to have this model and its already discussed or documented, I am happy to check on the same if you can provide some reference. The only thing which I am proposing here is the possibility of removing the dependency to va or avoiding the CSI leak which was raised by storage team.
Who sends the node stage request? If you missed it, it is the healer!
where does the healer get the secrets from? From the VA and PV object details
Now, what do you want to store in the meta file and retrieve later?
Here is the basic design doc: https://github.com/ceph/ceph-csi/blob/devel/docs/design/proposals/rbd-volume-healer.md#volume-healer
Eventhough we have support for rbd nbd in CSI , it is not the default or not qualified much. This is a feature request to bring more qualification to rbd-nbd and make it fully supported with RBD CSI. The main issue we have come across on user space based client/mounter ( fuse, rbd-nbd) is the unavailability or disturbed mounts on workload attached shares/volumes when the CSI pods are respawned. This is the main technical limitation we have atm.
Reference issue# https://github.com/ceph/ceph-csi/issues/792
Here are the Items needed to move this issue to fixed:
Ceph CSI v3.4.0 Progress: [Alpha]
v3.4.0 Beyond:
csi leak