Open Manjunathagopi opened 5 months ago
CRT retries and timeouts work differently from regular CPP SDK config option and CRT S3 client currently does not honor them. We have a backlog feature request here to improve the situation https://github.com/aws/aws-sdk-cpp/issues/2594.
As an alternative you can configure lowSpeedLimit to indicate to CRT to kill requests that are too slow.
Hi @DmitriyMusatkin, i tried setting lowSpeedLimit to 75MBps which will kill the requests that are below 75MBps and tried running the test. Still some requests were taking around 1second to read 25MB(bitrate of around 27MBps) from s3.
Aws::S3Crt::ClientConfiguration config;
config.lowSpeedLimit = 78643200;
Hi, Any update on the above query?
the way low speed limit is configured in sdk is to kill connection if throughput dips under the specified number for a given number of intervals (3 by default and otherwise derived from request timeout) - https://github.com/aws/aws-sdk-cpp/blob/main/generated/src/aws-cpp-sdk-s3-crt/source/S3CrtClient.cpp#L330.
Hard to tell exactly, whats going on and we might need trace level logs to debug it further. Note: CRT already breaks up get requests into part requests based on the configured part size and your code does the same thing, so there might be something weird interaction between the 2 going on.
Hi @DmitriyMusatkin attaching the trace logs for this can you please check this?
[Uploading aws_sdk_2024-06-04-11.log…]()
bad link? it just points to this issue
Sorry @DmitriyMusatkin here it is aws_sdk_2024-06-04-11.log
The log you show is less than 2 secs.
[DEBUG] 2024-06-04 11:36:31.624 task-scheduler [140710120072960] id=0x7ff98c003238: Scheduling gather_statistics task for future execution at time 1311437023221890
[DEBUG] 2024-06-04 11:36:31.965 task-scheduler [140709998868224] id=0x7ff988003d38: Scheduling gather_statistics task for future execution at time 1311437363810723
[DEBUG] 2024-06-04 11:36:31.966 task-scheduler [140709998868224] id=0x7ff988026158: Scheduling gather_statistics task for future execution at time 1311437364379715
...
[DEBUG] 2024-06-04 11:36:32.276 task-scheduler [140710120072960] id=0x7ff98c003238: Running gather_statistics task with <Canceled> status
[DEBUG] 2024-06-04 11:36:32.276 task-scheduler [140709998868224] id=0x7ff988003d38: Running gather_statistics task with <Canceled> status
[DEBUG] 2024-06-04 11:36:32.276 task-scheduler [140709998868224] id=0x7ff988026158: Running gather_statistics task with <Canceled> status
You can see the task to monitor the throughput all canceled within 1 sec, which they are scheduled to run after 1 sec. referring to here
The reason for those task to be canceled is because the request has completed.
[TRACE] 2024-06-04 11:36:32.275 S3MetaRequest [140710030337792] id=0x1837460 Meta request clean up finished.
As @DmitriyMusatkin said, SDK will only kill the connection unless it keeps being slow for a certain time.
Hi @TingDaoK @DmitriyMusatkin, Below attached the aws trace logs for s3 read of 25 MB 8 times(total of 200MB). Every time read is taking around 1second for each 25MB read.
config.throughputTargetGbps = 1.0;
config.partSize = 10485760;
config.httpRequestTimeoutMs = 100;
config.connectTimeoutMs = 100;
config.requestTimeoutMs = 100;
config.lowSpeedLimit = 78643200;
https://getshared.com/SvtM9D7w
Can you please explain why it is taking more time than the configured lowspeedlimt and targetthroughput and when will SDK kill the connection and what is the exact threshold to consider it to be slow?
Hi, Any update on the above query?
We reviewed the logs. A couple things are coming into play here
what seems to happen in your case is:
Hi @DmitriyMusatkin thanks for the detailed analysis. I have few questions regarding this.
config.throughputTargetGbps = 1.0;
config.partSize = 524288000;
config.lowSpeedLimit = 157286400;
Attaching the trace logs download link for issue 2 LINK can you please check this?
Hi, Any update on the above query?
@DmitriyMusatkin : Could you please point me to the place in the code where partSize is being used for parallel requests? So far I couldn't find a place where this value is being used as you described and I'm trying to understand how the parallelization works (e.g. by dynamically spawning more threads or by using async IO on the socket or ...). (Just by the way: Same for throughputTargetGbps, where I also couldn't find a code path that seems to use this value, so I'm wondering what a good default would be if the line speed is unknown - should it rather be too high or too low in case of doubt or can it be set to a special value like "zero" to be ignored?)
Describe the bug
Below mentioned s3 crt client configs we set.
We are reading 25MB at a time and we are measuring how much time it is taking for getObject. average of 500ms taken for each getObject Even though we set all the timeouts s3 crt client is taking more time? We even tried with no retry strategy, but still there is no use and the behaviour is same.
GetObject() is not exiting even timeout is reached. One time we saw GetObject() for size 25MB took 10seconds. Basically behavior is same even if we didn't set timeouts or assign custom retry strategy. we are running this in a c5a4x large instance which is running in ap-south-1 region and s3 bucket is also in the same region and account. Please tell whats wrong with this and suggest how to reduce time taken for GetObject().
Expected Behavior
With zero retries GetObject should exit once timeout is reached.
Current Behavior
With zero retries GetObject is not exiting even timeout is reached.
Reproduction Steps
zero retries
default strategy
Possible Solution
No response
Additional Information/Context
No response
AWS CPP SDK version used
1.11.269
Compiler and Version used
gcc (GCC) 4.8.5
Operating System and version
CentOS Linux and version 7