aws-solutions-library-samples / osml-model-runner

MIT No Attribution
7 stars 1 forks source link

fix: change put_object to upload_file to improve performance #85

Closed RanbirAulakh closed 2 months ago

RanbirAulakh commented 3 months ago

Issue #, if available: n/a

Notes

"Event request-created.s3.CreateMultipartUpload: calling handler <function signal_transferring at 0x7f9aa3e6d940>"

"Submitting task CreateMultipartUploadTask(transfer_id=0, {'bucket': '****, 'key': 'test-d9d3d3869d6cc5bf380288249401cd17/d9d3d3869d6cc5bf380288249401cd17.geojson', 'extra_args': {'ACL': 'bucket-owner-full-control'}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f9a98a853d0> for transfer request: 0."

Making request for OperationModel(name=CompleteMultipartUpload) with params: .... other details .... 'MultipartUpload': {'Parts': [{'ETag': '"41ee040f629a6b7d613cde4453ef5694"', 'PartNumber': 1}]}}}, 'S3Express': {'bucket_name': 'mr-bucket-sync-825536440648'}, 'signing': {'region': 'us-west-2', 'signing_name': 's3', 'disableDoubleEncoding': True}, 'endpoint_properties': {'authSchemes': [{'disableDoubleEncoding': True, 'name': 'sigv4', 'signingName': 's3', 'signingRegion': 'us-west-2'}]}}}

Default S3 Transfer Configuration:

    default: TransferConfig = TransferConfig(
        multipart_threshold=128 * 1024**2,  # 128 MB
        max_concurrency=10,
        multipart_chunksize=256 * 1024**2,  # 256 MB
        use_threads=True,
    )
  1. multipart_threshold: If a file size exceeds this threshold (in this case, 128 MB), the file will be uploaded using the multipart upload method, which splits the file into smaller parts and uploads them in parallel. Multipart uploads are more efficient for large files, allowing you to upload different parts simultaneously and retry failed parts without starting over.

  2. multipart_chunksize: Each file part that is uploaded in parallel will be 128 MB in size. Larger chunk sizes mean fewer parts need to be uploaded, reducing the overhead of managing many small parts. However, larger parts might take longer to upload individually, and if a part fails, more data will need to be re-uploaded.

Performance

200 MB Test

Metric Local Machine SM Notebook (m5.2xlarge)
JSON Data Created & Stored 225.00 MB 225.00 MB
Time to Calculate JSON Size 1.20 seconds 2.68 seconds
put_object Upload Time 8.25 seconds 4.42 seconds
upload_file Upload Time 7.50 seconds 4.69 seconds

2 GB Test

Metric Local Machine SM Notebook (m5.2xlarge)
JSON Data Created & Stored 2.20 GB 2.20 GB
Time to Calculate JSON Size 11.88 seconds 26.95 seconds
put_object Upload Time 75.52 seconds 52.81 seconds
upload_file Upload Time 26.66 seconds 35.83 seconds

4 GB Test

Metric Local Machine SageMaker
JSON Data Created & Stored 4.39 GB 4.39 GB
Time to Calculate JSON Size 23.96 seconds 49.21 seconds
put_object Upload Time 149.28 seconds 94.73 seconds
upload_file Upload Time 68.46 seconds 69.90 seconds

Checklist

Before you submit a pull request, please make sure you have the following:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.