NetApp / trident

Storage orchestrator for containers
Apache License 2.0
748 stars 220 forks source link

GCP Cloud Volumes: Failed to read 'gcp-cvs driver' version #917

Closed lorenzboguhn closed 1 week ago

lorenzboguhn commented 1 month ago

Describe the bug I am trying to provide (NFS) RWX PVCs using the NetApp Astra Trident Operator via the Cloud Volumes Service for GCP. I done created:

This results in a failed status for the TridentBackendConfig and the following error message:

message: 'Failed to create backend: problem initializing storage driver ''gcp-cvs'': error validating gcp-cvs driver. failed to read version' 

Moreover, PVCs are stuck in the pending state there are no PVs created.

TridentBackendConfig:

{
  "apiVersion": "trident.netapp.io/v1",
  "kind": "TridentBackendConfig",
  "metadata": {
    "name": "basic-nfs",
    "namespace": "nfs"
  },
  "spec": {
    "version": 1,
    "storageClass": "software",
    "serviceLevel": "standard",
    "storageDriverName": "gcp-cvs",
    "projectNumber": "XXXXXXXXX",
    "apiRegion": "eu-west3",
    "apiKey": {
      "type": "service_account",
      "project_id": "XXXXXXXXXXXXX",
      "client_email": "astra-trident-XXXXX@XXXXXXXXXXXXX.iam.gserviceaccount.com",
      "client_id": "XXXXXXXXXXXXXXX",
      "auth_uri": "https://accounts.google.com/o/oauth2/auth",
      "token_uri": "https://oauth2.googleapis.com/token",
      "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
      "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/astra-trident-XXXXXXXXXXXXXX.iam.gserviceaccount.com"
    },
}

TridentBackendConfig status:

status:                                                                                  
  backendInfo:                                                                               
    backendName: ""                                                                    
    backendUUID: ""                                                               
  deletionPolicy: delete                                                     
  lastOperationStatus: Failed
  message: 'Failed to create backend: problem initializing storage driver ''gcp-cvs'':
    error validating gcp-cvs driver. failed to read version'                                           
  phase: ""                                                                                        

trident-main container logs:

trident-main time="2024-07-29T12:20:30Z" level=info msg=-------------------------------------------------                                                                                                   
trident-main time="2024-07-29T12:20:30Z" level=info msg=-------------------------------------------------                                                                                                   
trident-main time="2024-07-29T12:23:10Z" level=error msg="Could not initialize storage driver." crdControllerEvent=add error="error validating gcp-cvs driver. failed to read version" logLayer=core request
ID=a4f031d5-b7b9-4c04-a8ae-45dbe98ec799 requestSource=CRD workflow="cr=reconcile"                                                                                                                          
trident-main time="2024-07-29T12:23:10Z" level=warning msg="Cannot terminate an uninitialized backend." backend=basic-nfs-2 backendUUID=d945fffe-620c-4ee4-a3a1-60838c26c269 crdControllerEvent=add driver=g
cp-cvs logLayer=core requestID=a4f031d5-b7b9-4c04-a8ae-45dbe98ec799 requestSource=CRD state=failed workflow="cr=reconcile"                            

Digging through your code I found the source of the error log: https://github.com/NetApp/trident/blob/722e7ef9e58b56fa5815af10c8794b0097ac8b9c/storage_drivers/gcp/api/gcp.go#L277

Moreover, I found that the api url per default is hardcoded in the operator to cloudvolumesgcp-api.netapp.com and the \version endpoint seems to not be implemented in the new api. Digging more through the code i found that you can configure the api url, nevertheless the \version is hardcoded. Using curl and the old api I could get 403 responses on \version indicating that the endpoint exists. For the new api I get a 404 response on the \version endpoint idicating that the endpoint do not exists.

Old api url (With depcreated note): cloudvolumesgcp-api.netapp.com https://console.cloud.google.com/apis/library/cloudvolumesgcp-api.netapp.com

New api url:netapp.googleapis.com https://console.cloud.google.com/apis/library/netapp.googleapis.com

Environment

The ampersands && are used to show the different versions of I tried to use.

To Reproduce Steps to reproduce the behavior:

Expected behavior The TridentBackendConfig get ready and one can provision RWX PVCs using the NetApp Astra Trident Operator via the Cloud Volumes Service for GCP.

Additional context Digging through your code I found that you can override the API url, but nevertheless, the \version endpoint seems to not be implemented in the new API. At least I was able to curl the old one and get a 403 but only 404 for the new API url with the complete path.

After unsuccessfully communicating with the google support, I created a basic vm with the default cos Image used by google workers and found that nfs-common is already installed in these images.

Furthermore, your docs state that the nfs-common package is required on the nodes. Following, I have made sure that I do not need to install the nfs tools yourself in the managed gke (https://docs.netapp.com/us-en/trident/trident-use/worker-node-prep.html#nfs-volumes) as they are already installed in the gke images:

I would appreciate any help or advice on how to solve the problem. Best Regards, Lorenz

clintonk commented 1 month ago

Hello, @lorenzboguhn. If you are trying to use the API netapp.googleapis.com, that refers to the new 1st-party Google Cloud NetApp Volumes (GCNV) service, not the older 3rd-party Cloud Volumes Service (CVS), which is deprecated and is being fully replaced by GCNV this year. Trident 24.06 added a tech preview driver for GCNV, and full support is expected in 24.10. The Trident docs for GCNV should be published soon, but as you are comfortable reading the code, I expect you can kick the tires starting with a minimal example config file:

{
    "version": 1,
    "storageDriverName": "google-cloud-netapp-volumes",
    "projectNumber": "XXXXXXXXXXXX",
    "location": "europe-west6",
    "serviceLevel": "premium",
    "apiKey": { <GCP service account JSON, same as CVS used> },
}
lorenzboguhn commented 3 weeks ago

Hello @clintonk, First of all thank you for your response and the explanation. This makes sense to me now. I have tried to use the google-cloud-netapp-volumes driver and unfortunately I am not able to create a backend. Now I get the following log message:

Warning  Failed  13m (x25 over 135m)  trident-crd-controller  Failed to create backend: problem initializing storage driver 'google-cloud-netapp-volumes': error validating google-cloud-netapp-volumes GCNV API. rpc error: code = InvalidArgument desc = Request contains an invalid argument.; no GCNV storage pools found for Trident pool basic-nfs_pool

I have tried to set it via the "storagePools": ["test-pool"] setting but this was ignored. Additionally I have tried to create a storage pool named basic-nfs_pool but was rejected with the following error by the google cloud:

storage pool „basic-nfs_pool“ creation invalid request error: "resourceId in body should match '^[a-z]([a-z0-9-]{0,61}[a-z0-9])?$'". 

This indicates that _ are forbidden in the resource name and I am not sure how to configure the resourceId of the storage pool in the trident backend config file. I will try to check that tomorrow and see if I can get it to work.

Again thank you for your help. Lorenz

clintonk commented 3 weeks ago

@lorenzboguhn GCNV docs were just published. Hope that helps!

https://docs.netapp.com/us-en/trident/trident-use/gcnv.html https://docs.netapp.com/us-en/trident/trident-use/gcnv-prep.html https://docs.netapp.com/us-en/trident/trident-use/gcnv-examples.html

clintonk commented 1 week ago

@lorenzboguhn I hope you have GCNV working by this time. If not, please let us know.

lorenzboguhn commented 1 week ago

@clintonk the issue is now another. For the GCNV the operator tries to create/find a storage-pool which cannot be created due to _ in the name. The _ are unfortunatlly forbidden for google cloud resource names.