There is a potential issue where uploading very large imagery could result in a detection objects file exceeding 5 GB. Therefore, uploading geojson greater than 5 GB may fail. Switching from s3.put_object to s3.upload_file can resolve this issue. I have confirmed that multipart upload is being used, as indicated by the DEBUG logging.
"Event request-created.s3.CreateMultipartUpload: calling handler <function signal_transferring at 0x7f9aa3e6d940>"
"Submitting task CreateMultipartUploadTask(transfer_id=0, {'bucket': '****, 'key': 'test-d9d3d3869d6cc5bf380288249401cd17/d9d3d3869d6cc5bf380288249401cd17.geojson', 'extra_args': {'ACL': 'bucket-owner-full-control'}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f9a98a853d0> for transfer request: 0."
Making request for OperationModel(name=CompleteMultipartUpload) with params: .... other details .... 'MultipartUpload': {'Parts': [{'ETag': '"41ee040f629a6b7d613cde4453ef5694"', 'PartNumber': 1}]}}}, 'S3Express': {'bucket_name': 'mr-bucket-sync-825536440648'}, 'signing': {'region': 'us-west-2', 'signing_name': 's3', 'disableDoubleEncoding': True}, 'endpoint_properties': {'authSchemes': [{'disableDoubleEncoding': True, 'name': 'sigv4', 'signingName': 's3', 'signingRegion': 'us-west-2'}]}}}
multipart_threshold: If a file size exceeds this threshold (in this case, 128 MB), the file will be uploaded using the multipart upload method, which splits the file into smaller parts and uploads them in parallel. Multipart uploads are more efficient for large files, allowing you to upload different parts simultaneously and retry failed parts without starting over.
multipart_chunksize: Each file part that is uploaded in parallel will be 128 MB in size. Larger chunk sizes mean fewer parts need to be uploaded, reducing the overhead of managing many small parts. However, larger parts might take longer to upload individually, and if a part fails, more data will need to be re-uploaded.
Performance
200 MB Test
Metric
Local Machine
SM Notebook (m5.2xlarge)
JSON Data Created & Stored
225.00 MB
225.00 MB
Time to Calculate JSON Size
1.20 seconds
2.68 seconds
put_object Upload Time
8.25 seconds
4.42 seconds
upload_file Upload Time
7.50 seconds
4.69 seconds
2 GB Test
Metric
Local Machine
SM Notebook (m5.2xlarge)
JSON Data Created & Stored
2.20 GB
2.20 GB
Time to Calculate JSON Size
11.88 seconds
26.95 seconds
put_object Upload Time
75.52 seconds
52.81 seconds
upload_file Upload Time
26.66 seconds
35.83 seconds
4 GB Test
Metric
Local Machine
SageMaker
JSON Data Created & Stored
4.39 GB
4.39 GB
Time to Calculate JSON Size
23.96 seconds
49.21 seconds
put_object Upload Time
149.28 seconds
94.73 seconds
upload_file Upload Time
68.46 seconds
69.90 seconds
Checklist
Before you submit a pull request, please make sure you have the following:
[x] Code changes are compact and well-structured to facilitate easy review
[x] Changes are documented in the README.md and other relevant documentation pages
[x] PR title and description accurately reflect the changes and are detailed enough for historical tracking
[x] PR contains tests that cover all new code and the code has been manual tested
[x] All new dependencies are declared (if any), and no unnecessary libraries are added
[x] Performance impacts (if any) of the changes are evaluated and documented
[x] Security implications of the changes (if any) are reviewed and addressed
Issue #, if available: n/a
Notes
s3.put_object
tos3.upload_file
can resolve this issue. I have confirmed that multipart upload is being used, as indicated by the DEBUG logging.Default S3 Transfer Configuration:
multipart_threshold: If a file size exceeds this threshold (in this case, 128 MB), the file will be uploaded using the multipart upload method, which splits the file into smaller parts and uploads them in parallel. Multipart uploads are more efficient for large files, allowing you to upload different parts simultaneously and retry failed parts without starting over.
multipart_chunksize: Each file part that is uploaded in parallel will be 128 MB in size. Larger chunk sizes mean fewer parts need to be uploaded, reducing the overhead of managing many small parts. However, larger parts might take longer to upload individually, and if a part fails, more data will need to be re-uploaded.
Performance
200 MB Test
2 GB Test
4 GB Test
Checklist
Before you submit a pull request, please make sure you have the following:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.