drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
464 stars 55 forks source link

Support SDK-specific retry parameters #405

Open jsungg opened 8 months ago

jsungg commented 8 months ago

Hi there. I tried looking in the docs to see if there was any support for explicitly setting number of retries for the Google Storage Client when writing to a cloud bucket using a CloudPath object. Specifically for my use case I am trying to read/write to a GCS bucket many times but I sometimes get a 503 PUT error from GCS, and our infra wants to retry more frequently than the default retry rate for gcs.

Seems like this problem happened before in this issue: https://github.com/drivendataorg/cloudpathlib/issues/267

pjbull commented 8 months ago

It looks like the GCS retry functionality is per-method (rather than being set at a client level). We just took a PR with a similar structure for chunked downloads, so I could see implementing something like that. We could accept a ConditionalRetryPolicy object as a retry_policy kwarg for the GSClient and then use it to override the default if it is set.

The other option would be to implement it in your code using a library like tenacity wherever you have methods that use CloudPaths to do reading/writing.

pjbull commented 1 week ago

This came up in #477 for S3 as well.

As @jayqi we likely won't implement an independent retry method in cloudpathlib. We recommend tenacity or similar.

We would, however, support retry parameters for the provider SDKs as discussed here. In addition to GCS, I think we could similarly support custom retries using the boto functionality for S3. In fact, on S3 the version that uses the AWS Configuration File may just work as is.