NetApp / trident

Storage orchestrator for containers
Apache License 2.0
732 stars 218 forks source link

Astra Trident "failed to create cloned volume" - but creates it in Amazon FSx #868

Open RuairiSinkler opened 7 months ago

RuairiSinkler commented 7 months ago

Describe the bug This issue is intermittent, though happens more-often-than-not.

When trying to provision an AmazonFSx Flexgroup volume as a clone from a snapshot, Trident is reporting failure (and eternally failing) despite successfully creating the volume in AmazonFSx.

See (anonymised and shortened) Trident-main logs here:

{"@timestamp":"2023-11-15T08:54:47Z","crdControllerEvent":"add","driver":"ontap-nas-flexgroup","error":"API status: failed, Reason: Insufficient privileges: user '<USER_NAME>' does not have write access to this resource, Code: 13003","level":"error","logLayer":"core","message":"Error logging EMS message.","requestID":"7ca9ab3a-545f-4909-8ff2-c1f16effe381","requestSource":"CRD","workflow":"cr=reconcile"}
...
{"@timestamp":"2023-11-15T08:55:13Z","level":"info","logLayer":"core","message":"Snapshot created.","requestID":"1e0d8a35-4d84-4c11-89c5-e7039464fa64","requestSource":"CSI","snapshotName":"snapshot-794022c3-b28b-4d06-9712-4d8cbe3dd391","volumeName":"<VOLUME_NAME>","workflow":"snapshot=create"}
...
{"@timestamp":"2023-11-15T08:56:20Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:56:20Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"c6f5256f-415b-445f-8c27-04b3945f87b3","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:32Z","level":"warning","logLayer":"core","message":"Job not completed after 120 seconds.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"waiting for async response failed","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:32Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: waiting for async response failed","requestID":"e6d3032f-ed33-4bd1-991c-975364fd6c20","requestSource":"CSI"}
...
{"@timestamp":"2023-11-15T08:57:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:57:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"abf74801-1725-46a2-9b11-19616b4159cb","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:18Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:18Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"92fc6cdc-e886-4df7-8566-c467d553df75","requestSource":"CSI"}
{"@timestamp":"2023-11-15T08:58:48Z","backend":"<BACKEND_NAME>","backendUUID":"a8ee139b-d7d9-49ad-a269-49ea1f7b207d","error":"volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","level":"warning","logLayer":"core","message":"Failed to create cloned volume on this backend.","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI","sourceVolume":"pvc-22f2b128-4d86-4d2e-8607-34fe465a8348","volume":"pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b","workflow":"volume=create"}
{"@timestamp":"2023-11-15T08:58:48Z","level":"error","logLayer":"csi_frontend","message":"GRPC error: rpc error: code = Unknown desc = failed to create cloned volume pvc-d342bb73-0e83-4530-9f80-8bfef58fed2b on backend <BACKEND_NAME>: volume <VOLUME_PREFIX>_pvc_d342bb73_0e83_4530_9f80_8bfef58fed2b already exists","requestID":"2d709a11-c424-4112-b9c1-8022f7f22ab3","requestSource":"CSI"}

What appears to be happening is that Trident is requesting the clone be created, timing out while waiting for it to be finished, and then requesting it again. The first request then succeeds, and the second fails because the "volume already exists".

This can also be seen in the AmazonFSx job history:

11/15 08:58:35 <FSX_ID> 
                              <SVM> 
                                         Clone Create         Failed      1453
     Description: Create
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_
                  4530_9f80_8bfef58fed2b
      Completion: Failed to create the FlexGroup clone volume                  
                  "<VOLUME_PREFIX>_pvc_d342bb73_0e83_                  
                  4530_9f80_8bfef58fed2b" in Vserver                           
                  "<SVM>". Reason: VLDB: volume                   
                  record not created, duplicate key. Cleanup                   
                  of the partially created FlexGroup clone                     
                  volumes failed. Contact technical support                    
                  for assistance.                                              
11/15 08:57:11 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Succeeded   1452 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b                                       
      Completion: Successful                                                   
11/15 08:56:28 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Running     1453 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b                                       
11/15 08:55:17 <FSX_ID>                                       
                              <SVM>                               
                                         Clone Create         Running     1452 
     Description: Create                                                       
                  <VOLUME_PREFIX>_pvc_d342bb73_0e83_                   
                  4530_9f80_8bfef58fed2b  

Note how Job 1452 is interleaved with the beginning of job 1453 - both requested from Trident. 1452 then succeeds, and 1453 (the one Trident is now presumably tracking) fails due to a duplicate key.

The way out of this situation appears to be manually deleting all resources on the cluster (editing out finalizers etc.) as well as the created resources in AmazonFSx.

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce Steps to reproduce the behaviour:

The following is completed by automatically by a custom operator:

Expected behaviour A clear and concise description of what you expected to happen.

The clone should be created and bound to the PVC as normal.

Additional context Add any other context about the problem here.

This problem does not occur all the time, but seems to be most of the time it fails in this way. It was originally occurring on 23.01.0 originally, but is also still happening on latest 23.10.0 after upgrading.

balaramesh commented 5 months ago

Hello @RuairiSinkler

Did you happen to run into this issue again at all?

Thank you for the detailed description. It does look like Trident expected a response from FSxN, but it never arrived on time. This makes me feel like it could be an isolated instance, but I thought I'd ask you anyways...

RuairiSinkler commented 4 months ago

Hi @balaramesh - yes we hit this issue repeatedly, around 50% of the time I would say and were unable to find a solution. It (amongst other issues) has caused us to have to abandon Trident as our NetApp CSI solution.