Closed rohansuri closed 1 month ago
As an aside, is it possible to turn off the auto_ranged_get behaviour in S3CrtClient and do a single GET?
What you are looking for is multipart_upload_threshold, but unfortunately it has not been exposed on cpp sdk side yet. We can use this ticket to track exposing it in crt client config.
CRT uses part size as a hint to buffer pooling allocator, which is used to avoid allocating mem over and over again for buffers. by setting part size to 5gb you are blowing past the default buffer pool budget of 2gb and looks like we are missing sanity checking somewhere to error out in this case. You can increase buffer budget by setting memory_limit_in_bytes, but that will use up a lot of mem if you are setting part sizes to 5GB. In general setting part size is not a recommended approach to control when splitting happens or not.
On a side note, why do you want to disable MPU behavior? Using regular low level s3 client in this case might be an alternative depending on what you want to achieve.
Thank you @DmitriyMusatkin for looking into this.
On a side note, why do you want to disable MPU behavior? Using regular low level s3 client in this case might be an alternative depending on what you want to achieve.
I want to save on number of PUT/GET API calls done as every call has a cost associated to it. I know S3CrtClient is built to maximise throughput by doing MPU, but I'm ok to sacrifice it. I was thinking of still prefering to use S3CrtClient over S3Client because:
Thank you for creating a ticket to expose the multipart_upload_threshold
config. Once exposed, setting that config alone should be enough to avoid S3CrtClient automatically doing MPU and ranged GETs, is that right?
Also just want to confirm, once I've set the multipart_upload_threshold
to 5GB. For a PUT request, I hope the client will not require bringing in the entire request body into memory (consuming 5GB), in order to make the request? Rather it'll happen in a streaming fashion based on whatever buffer size is chosen by the client?
As an additional request, it seems the low level s3_client in aws-c-s3
also supports specifying the partSize on a per request basis? Would it be possible to expose that as well through the upper layers? The use case could be that for certain kind of objects in a S3 bucket one does care about throughput, but for some other one doesn't.
@sbiscigl I do have this issue as well. Appreciate your help!
Thank you for creating a ticket to expose the multipart_upload_threshold config. Once exposed, setting that config alone should be enough to avoid S3CrtClient automatically doing MPU and ranged GETs, is that right?
MPUs yes, ranges GETs no, but that configuration is exposed now and will be tagged later. researching the other questions.
I dont think that use case is something that supported very well in crt right now. CRT is build around the idea of having a bunch of data being read into buffers and worked on in parallel. All data needs to be in the buffer before it can be sent and CRT does not currently stream data from source directly to s3 (there are various reasons for this and something that can be improved). So in case of 1 single chunk upload, CRT will end up with having to load the whole chunk into the buffer and then dispatch it to a single connection that will send it to s3. So unfortunately with 5GB threshold it will attempt to load all 5GB into mem. Its not the primary usecase crt was designed for, but something i think we should improve. Dont think there is any settings you can set to mitigate this in a graceful way. We probably should move this issue to aws-c-s3 as there is not much cpp sdk can do here.
Thank you both for your detailed replies.
CRT is build around the idea of having a bunch of data being read into buffers and worked on in parallel. All data needs to be in the buffer before it can be sent and CRT
Understood, so it seems even after exposing multipart_upload_threshold
- memoryLimitBytes
or downloadMemoryUsageWindow
won't have any effect and we will still require a buffer of size 5GB in the worst case.
@DmitriyMusatkin I'd just like to confirm the same with S3Client (and not S3CrtClient), as I plan to switch to using it:
Additionally, a suggestion is to document these facts in the README, about S3CrtClient vs S3Client in terms of their buffering requirements.
Yes, multipart_upload_threshold will allow you to force crt to only do one request, but it will still need to allocate memory for the whole request. Windowing stuff affects how many reqs crt runs in parallel to not overwhelm consumer, but it will not allow you to force download to complete in 1 api call.
I noticed you also opened this issue here about using the non crt S3 client. We are still working on fixing that, but I wanted to ask if you had any other questions related to s3crt?
@jmklix I'm satisfied with the details @DmitriyMusatkin shared. And decided to use the S3Client.
Although had one more question on S3CrtClient: If I have a 5GB object and partSize is the default of 8MB, how many concurrent upload/download in memory buffers (or chunks) will I have? I'm trying to determine the peak memory consumption of s3Crt.
CRT pools buffers and has overall limit on mem consumption that is determined from the target throughput and is configurable using mem limit in bytes setting. For any target throughput < 25Gbps, mem limit is set to 2 GB. And it maxes out at 8GB for anything over 75Gbps
Describe the bug
I want to turn off S3CrtClient's default behaviour of doing multi-range GETs and multi-part PUTs. So I set Aws::S3Crt::ClientConfiguration.partSize set to 5GB. So that only for objects greater than 5GB, will a multi-part PUT will happen (5GB is chosen because that is the size limit of a single PUT call). Otherwise, I want only a single PUT/GET to happen for objects lesser than 5GB.
However, my application is crashing. As per the stack trace, it seems the client is automatically doing a ranged GET of 5GB in size and trying to allocate a buffer of that size, which results in an assertion failing as the buffer pool limit is only 2GB, resulting in SIGABRT.
Here's the stack trace:
Using lldb to print the meta_request->part_size:
The buffer pool limit being:
Expected Behavior
No crash should happen.
Current Behavior
Application crashes.
Reproduction Steps
Possible Solution
No response
Additional Information/Context
I also tried setting
config.downloadMemoryUsageWindow = 100 * 1024 * 1024;
but has no effect and I still see the crash.What I am looking for is to not do ranged GETs for objects smaller than 5GB. Similarly, not do multi-part PUTs for objects smaller than 5GB.
AWS CPP SDK version used
1.11.411
Compiler and Version used
Apple clang version 15.0.0 (clang-1500.3.9.4)
Operating System and version
MacOS 14.4.1