ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 536 forks source link

Add timeout to Ceph GET API calls #3657

Open Madhu-1 opened 1 year ago

Madhu-1 commented 1 year ago

Describe the feature you'd like to have

Provide a way to configure the timeout for the ceph. Get API calls to avoid command stuck if there is some problem between the ceph cluster and csi driver (cluster health, slow ops, or short network connectivity problem)

What is the value to the end user? (why is it a priority?)

Currently, if ceph doesn't responds to any CSI calls the cephcsi will start throwing an operation already exists error message even if the ceph cluster is recovered, the only way to recover the csi driver is to restart the csi pods. Restarting csi driver pods is not an optimal solution in most of the production clusters.

How would the end user gain value from having this feature?

Avoid restarting csi pods in the production clusters even if any GET API call is stuck. The ask is to add timeout only to GET API calls not for any other operations to avoid stale resources in the cluster.

But again we need to consider all the corner cases carefully before doing this change.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Rakshith-R commented 1 year ago

/assign @karthik-us

nixpanic commented 1 year ago

Maybe context.WithTimeout() can be used in the ConnPool.Get() function. The Get() could return an additional context, which would be cancelled after the timeout. The ConnPool.Put() should then mark the context as successful completed.

If the context times out, the ConnPool logic should remove the connection from the pool, and disconnect (which hopefully works, but may block indefinitely?) it.

karthik-us commented 1 year ago

Thanks @nixpanic for your inputs on this. I will do some more research on this as well as John's comment on the referenced issues where he suggested on exploring the timeout parameters on rados connections. Let me figure out the best possible approch to fix this issue.

ADustyOldMuffin commented 9 months ago

Howdy, any progress on this? We experience this all the time with calls to mount PVCs just hanging forever. The CSI doesn't log anything or provide any metrics so we have no way to troubleshoot this.

Madhu-1 commented 8 months ago

@ADustyOldMuffin you might get operations already exists error/warning message. if you see that may be its the time to fix things on ceph cluster and restart the csi pods (for now as workaround).

ADustyOldMuffin commented 8 months ago

@Madhu-1 I get that it indicates something is wrong elsewhere, but that's not a valid answer to the CSI pods getting stuck forever. The CSI should handle connections to Ceph gracefully and output meaningful error messages not generic ones. I would expect anything attempting to connect with an outside source to manage connections properly.

yxxhero commented 2 months ago

any update? I meet same issue. thanks so much.

Madhu-1 commented 2 months ago

@Madhu-1 I get that it indicates something is wrong elsewhere, but that's not a valid answer to the CSI pods getting stuck forever. The CSI should handle connections to Ceph gracefully and output meaningful error messages not generic ones. I would expect anything attempting to connect with an outside source to manage connections properly.

@ADustyOldMuffin we use go-ceph API which internally calls the C function which doesn't have timeout options, if the commands gets stuck cephcsi cannot know the reason for it, that is the reason we have a generic error message, if we can know the exact reason for ceph we can handle that but currently that's not possible.

Madhu-1 commented 2 months ago

any update? I meet same issue. thanks so much.

@yxxhero not yet, we don't have a owner for this one, but we will revisit this in next release (if possible), contribution from community is always welcome :)

mgfnv9 commented 4 days ago

I meet same issue, any update for this question @Madhu-1 ?