Facing 429 Rate limit error when connecting more than 1000 pods

murtaza98 commented 2 months ago

Describe the issue

When trying to mount a GCS bucket using the FUSE driver to more than 1000 pods within a GKE cluster, we get 429 errors within the fuse sidecar container, due to which we've pods stuck in ContainerCreateError state. I noticed a similar error which was previously reported and as per the suggestions there, we're running GKE version greater than 1.29.3-gke.1093000

(a) Do you see this as part of mount failure? Or it comes after the successful mount in read operation? Yes, this is part of mount failures (b) Do you see the failure in all the pods or in few pods? Not all pods are affected, but a considerable number of pods get affected once u scale beyond 500 pods. (c) If possible, could you please provide the gcsfuse debug logs, of any failure pods? Please find the logs below

System & Version (please complete the following information):

OS: [e.g. Ubuntu 20.04] NA
Platform [GCE VM, GKE, Vertex AI] : GKE
Version [Gcsfuse version and GKE version] GKE: 1.29.7-gke.1104000 gcsFuse: 2.3.2-gke.0

Additional context

Logs:

INFO 2024-09-05T11:55:13.048122426Z gcsfuse config file content: map[cache-dir: file-cache:map[max-size-mb:0] logging:map[file-path:/dev/fd/1 format:json severity:warning] metadata-cache:map[ttl-secs:600]]
INFO 2024-09-05T11:55:13.048136126Z start to mount bucket "coderunner-compile-binary-staging" for volume "root-submission-pv"
INFO 2024-09-05T11:55:13.048143566Z gcsfuse mounting with args [--file-mode 664 --http-client-timeout 30s --max-idle-conns-per-host 500 --app-name gke-gcs-fuse-csi --temp-dir /gcsfuse-buffer/.volumes/root-submission-pv/temp-dir --experimental-metadata-prefetch-on-mount disabled --config-file /gcsfuse-tmp/.volumes/root-submission-pv/config.yaml --dir-mode 775 --implicit-dirs --foreground --uid 0 --gid 2000 coderunner-compile-binary-staging /dev/fd/3]...
INFO 2024-09-05T11:55:13.048149986Z waiting for SIGTERM signal...
INFO 2024-09-05T11:55:13.049156816Z gcsfuse for bucket "coderunner-compile-binary-staging", volume "root-submission-pv" started with process id 29
ERROR 2024-09-05T11:55:13.051365898Z Error while mounting gcsfuse: mountWithStorageHandle: fs.NewServer: create file system: SetUpBucket: Error in iterating through objects: Get "https://storage.googleapis.com/storage/v1/b/coderunner-testcases-staging/o?alt=json&delimiter=&endOffset=&includeFoldersAsPrefixes=false&includeTrailingDelimiter=false&matchGlob=&maxResults=1&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false": compute: Received 429 `429 Too Many Requests: {"error":"quota_exceeded","error_description":"[Security Token Service] The request was throttled due to rate limit. Please retry after a few seconds."} `
ERROR 2024-09-05T11:55:13.052525018Z mountWithStorageHandle: fs.NewServer: create file system: SetUpBucket: Error in iterating through objects: Get "https://storage.googleapis.com/storage/v1/b/coderunner-testcases-staging/o?alt=json&delimiter=&endOffset=&includeFoldersAsPrefixes=false&includeTrailingDelimiter=false&matchGlob=&maxResults=1&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false": compute: Received 429 `429 Too Many Requests: {"error":"quota_exceeded","error_description":"[Security Token Service] The request was throttled due to rate limit. Please retry after a few seconds."}
ERROR 2024-09-05T11:55:13.052547328Z `
ERROR 2024-09-05T11:55:13.054356188Z gcsfuse exited with error: exit status 1
INFO 2024-09-05T11:55:13.060454794Z connecting to socket "/gcsfuse-tmp/.volumes/root-submission-pv/socket"
INFO 2024-09-05T11:55:13.061002704Z get the underlying socket
INFO 2024-09-05T11:55:13.061019324Z calling recvmsg...
INFO 2024-09-05T11:55:13.061425914Z parsing SCM...
INFO 2024-09-05T11:55:13.061462514Z parsing SCM_RIGHTS...

Persistent Volume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: consumer-storage-pv
  namespace: {{ .Values.namespace }}
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 200Gi
  storageClassName: consumer-storage-pv
  mountOptions:
    - implicit-dirs
    - uid=1001
    - gid=2000
    - experimental-metadata-prefetch-on-mount=disabled
    - max-idle-conns-per-host=500
    - http-client-timeout=30s
  csi:
    driver: gcsfuse.csi.storage.gke.io
    volumeHandle: coderunner-testcases-staging
    volumeAttributes:
      gcsfuseLoggingSeverity: warning
      fileCacheCapacity: "10Gi"
      metadataCacheTTLSeconds: "600"

Traffic on Bucket:

gargnitingoogle commented 2 months ago

@murtaza98 Thanks for reporting this issue. I am looking into this issue.

gargnitingoogle commented 2 months ago

The key error is {"error":"quota_exceeded","error_description":"[Security Token Service] The request was throttled due to rate limit. Please retry after a few seconds.". This indicates that you have insufficient quota for Security Token Service. You need to increase the quota for Security Token Service in your GCP project.

Please find more details on https://cloud.google.com/iam/quotas#quotas . Let me know if you any further issues.

murtaza98 commented 2 months ago

Thanks for looking into this issue @gargnitingoogle We'll work on increasing this quota and running another test next week. Will keep you posted about the findings

gargnitingoogle commented 2 months ago

Wanted to add here that STS quota for a project is shared across all conccurrent GKE runs, GCSFuse mounts in that project at a time. As a rule of thumb, if your clusters would be doing maximum N GCSFuse mounts concurrently (across all pods and clusters), then you should set your STS quota to above 2N per minute.

Tulsishah commented 1 month ago

Hey @murtaza98,

Has your issue been resolved? To help us debug the issue better, could you please provide the following details about the project:

How many GKE clusters are there, and what are their versions?
The exact cluster details where you're seeing the mount failures due to quota issues.
Pod and volume information. If you prefer not to share this information publicly, you can file a support bug.

Let us know if you have any other questions!

Thanks, Tulsi Shah

murtaza98 commented 1 month ago

Hello all, Thanks for your help in this issue. The above suggestion on increasing the rate limit on STS service did indeed help us scale further than 1000 pods. We'd requested an increase in the STS request quota to 60k per minute (1k per sec). With the above limit, we were able to bring about 10k pods (Approx 1800 nodes); with each pod mounted to max 2 storage buckets. At this stage, we observed a slightly stable setup. There was a slight increase in our read and write operations performance to Google's storage mount.

As we scaled further, we encountered the STS request quota limit again, which impacted our ability to bring up new pods. While this is something we could address by contacting support to increase the quota, it wasn’t our primary concern.

The more critical issue we observed was a significant degradation in the performance of read and write operations on the Google storage mount. The read and write were now taking three times longer than before. Since there was no indication of throttling on the storage bucket due to rate limits, it suggests that the Fuse driver itself was not scaling effectively.

Based on the results, we’ve decided to pause the POC with GCS Fuse for now. We achieved better performance by manually handling file uploads and downloads within our application using storage APIs, which proved more scalable for our needs.

For feedback, I’d like to inquire if there are any guidelines for mounting GCS Fuse on pods at large scale, specifically at least 50k pods.

Additionally, I’d like to better understand why the Fuse driver is generating so many STS calls. While I understand that a pod will make an authorization call upon its initial spin-up, which would lead to STS requests when new pods are added to a cluster, in our tests we observed continuous API calls to STS even without scaling up, simply during read and write operations to the bucket. My current understanding is that STS is primarily for authorization, so why is the driver invoking it during read/write operations?

murtaza98 commented 1 month ago

@Tulsishah Answering your questions

How many GKE clusters are there, and what are their versions?

One GKE cluster. Version - 1.29.7-gke.1104000

The exact cluster details where you're seeing the mount failures due to quota issues.

Please feel free to contact me by email if you require this info for further debugging: murtaza@hackerrank.com

Pod and volume information.

We'd 2 buckets: 1 with file caching enabled: Per pod ephemeral storage was set to 2Gi and fileCachingCapacity was set to 1Gi

2nd bucket: no file caching and ephemeral storage set to 2Gi

Note: we were running approx 6 pods per node; and we always had more than 20Gi free for node.

gargnitingoogle commented 1 month ago

@raj-prince @sethiay @marcoa6 for the questions in https://github.com/GoogleCloudPlatform/gcsfuse/issues/2453#issuecomment-2357535046 .

songjiaxun commented 1 month ago

Hi @murtaza98 To potentially avoid the STS quota issue, could you try to add a new volume attribute to your PV spec? It will make the CSI driver skip unnecessary STS requests. We've fixed the issue in a newer GKE version without requiring this explicit volume attribute setting, but you don't have to upgrade your cluster in this POC. Using the volume attribute will achieve the same. Here is an example:

apiVersion: v1
kind: PersistentVolume
spec:
  ...
  csi:
    driver: gcsfuse.csi.storage.gke.io
    volumeHandle: coderunner-testcases-staging
    volumeAttributes:
      gcsfuseLoggingSeverity: warning
      fileCacheCapacity: "10Gi"
      metadataCacheTTLSeconds: "600"
      skipCSIBucketAccessCheck: "true"

The last line skipCSIBucketAccessCheck: "true" is all you need to add.

sethiay commented 1 month ago

Hey @murtaza98 Just want to check if skipCSIBucketAccessCheck: "true" suggested by @songjiaxun helped reduce the STS usage and quota issue for you ?

murtaza98 commented 1 month ago

Hello @sethiay

Sorry for the delayed response!

Unfortunately, we did not get a chance to run another load test with the above-requested change. As I mentioned above, we observed better results by consuming the cloud storage API for managing our file uploads and downloads directly within our POCs at our scale; hence, we're moving forward with that approach.

I'm closing this issue now, and I cannot validate the above change.

GoogleCloudPlatform / gcsfuse

Facing 429 Rate limit error when connecting more than 1000 pods #2453