Open RuairiSinkler opened 1 year ago
Hello @RuairiSinkler
Did you happen to run into this issue again at all?
Thank you for the detailed description. It does look like Trident expected a response from FSxN, but it never arrived on time. This makes me feel like it could be an isolated instance, but I thought I'd ask you anyways...
Hi @balaramesh - yes we hit this issue repeatedly, around 50% of the time I would say and were unable to find a solution. It (amongst other issues) has caused us to have to abandon Trident as our NetApp CSI solution.
Describe the bug This issue is intermittent, though happens more-often-than-not.
When trying to provision an AmazonFSx Flexgroup volume as a clone from a snapshot, Trident is reporting failure (and eternally failing) despite successfully creating the volume in AmazonFSx.
See (anonymised and shortened) Trident-main logs here:
What appears to be happening is that Trident is requesting the clone be created, timing out while waiting for it to be finished, and then requesting it again. The first request then succeeds, and the second fails because the "volume already exists".
This can also be seen in the AmazonFSx job history:
Note how Job 1452 is interleaved with the beginning of job 1453 - both requested from Trident. 1452 then succeeds, and 1453 (the one Trident is now presumably tracking) fails due to a duplicate key.
The way out of this situation appears to be manually deleting all resources on the cluster (editing out
finalizers
etc.) as well as the created resources in AmazonFSx.Environment Provide accurate information about the environment to help us reproduce the issue.
To Reproduce Steps to reproduce the behaviour:
The following is completed by automatically by a custom operator:
VolumeSnapshot
with a manually imported volume PVC as the sourcePVC
with the snapshot as its source to trigger a clonePVC
remaining pending forever, and the above logs and state of NetApp backendExpected behaviour A clear and concise description of what you expected to happen.
The clone should be created and bound to the
PVC
as normal.Additional context Add any other context about the problem here.
This problem does not occur all the time, but seems to be most of the time it fails in this way. It was originally occurring on
23.01.0
originally, but is also still happening on latest23.10.0
after upgrading.