GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
403 stars 115 forks source link

Configure modified retry for uploading partition yamls to GCS #1292

Open karcot1 opened 1 week ago

karcot1 commented 1 week ago

When uploading a large number of yamls to GCS (10,000) we are encountering a 503 error. This indicates the connection to GCS was interrupted, and is commonly encountered when uploading a large number of files via python.

The exact error message: 'Request failed with status code', 503, 'Expected one of', <HTTPStatus.OK: 200>

The recommendation with 5xx errors is to retry, and we have some sample code as a reference: https://github.com/googleapis/python-storage/blob/main/samples/snippets/storage_configure_retries.py#L55

The request is to please modify _write_gcs_file in gcs_helper.py to use retries.

Suggestion (based on above sample code):

from google.cloud import storage
from google.cloud.storage.retry import DEFAULT_RETRY

def _write_gcs_file(file_path: str, data: str):
    gcs_bucket = get_gcs_bucket(file_path)
    blob = gcs_bucket.blob(_get_gcs_file_path(file_path))

    modified_retry = DEFAULT_RETRY.with_deadline(500.0)
    modified_retry = modified_retry.with_delay(initial=1.5, multiplier=1.2, maximum=45.0)

    blob.upload_from_string(data, **retry=modified_retry**)

Ideally, the values (deadline, initial, multiplier, maximum) can be parameterized, so that end users can modify the values to get optimal performance.

sundar-mudupalli-work commented 1 week ago

Hi,

Review of the customer usage showed that the customer had upto 8 separate generate-partitions executions running in parallel. Given how long it takes to write to GCS (about 4 hrs for 10,000 objects). Each DVT execution is single threaded, so unclear how it raised 503 errors. DVT creates a client for each file as outlined below - which could exacerbate the situation.

When writing a large number of GCS files for partitions, it can take about 1.25 seconds to write each file to GCS. This is awfully slow. The problem is in get_gcs_bucket function - which creates a http client and a connection to the bucket every time. So if you have 10,000 files, this can take very long. However if the return value from get_gcs_bucket is cached, then writes to GCS for small files improve dramatically (> 20x).

It may make sense to restructure the gcs_helper package as a class, perhaps rename it to filsys_helper - which allows DVT to interact with Cloud Storage or the file system as specified by the user. If the customer specifies a GCS destination (or source), we can save the storage_client.get_bucket('bucket_name') result in the class so that it can be reused. This will significantly improve performance.

Sundar Mudupalli