Open pdumoulin opened 2 months ago
Hi @pdumoulin ,
This does seem related to the issue you linked, and also https://github.com/aws/aws-sdk-php/issues/29 and others. This seems to surface from a service side intermittent issue.
In all of the other issues with the same symptoms, the problem went away after a few retries. Are you able to configure the retryer to retry the upload 5-10 times and see if it makes the problem go away?
I have already engaged the S3 team internally to try and get this diagnosed and fixed, but so far haven't heard back.
Let me know if the retry strategy works for you. Thanks, Ran~
Hi @RanVaknin,
I set the retry count to 20
and tried all three retry modes, legacy
, adaptive
, and standard
as described here. The batch was unable to be completed in each case, encountering the same errors as above.
Due to the fact that the average runtime did not increase when the retry count was increased, and the aws-cli in the exact same environment is not encountering these issues, I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.
Please let me know if there is anything else I can do to help gather some more information about the issue and if there are any updates from the S3 team internally!
Hi @pdumoulin ,
Apologies for the late response. I got your internal ticket thank you for that.
I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.
You can verify whether the retries are taking effect or not by enabling the request logs:
$connection = S3Client::factory([
'debug' => true
]);
I will update your support person with next steps in order to engage S3.
Thanks, Ran~
Hi @RanVaknin ,
I set the debug level in the client factory as you suggested and was able to gather a lot more log information. I ran the same test and see aws-sdk-retry
headers being sent. Their value is always 0/0
never anything else. Can you explain what this value means? It would appear retries are never attempted, but my retry config is set for a retry count of 3
, so I would expect values like 0/3
or 1/3
if I am reading the header correctly.
Please let me know if there are any other data points I can extract from the logs to help.
Hi @pdumoulin ,
I'm not sure why the header is always 0/0. If I had to hypothesize, it's likely because of the challenge in updating this value when code is being ran concurrently. Instead of looking at the header value when inspecting the request logs, do you see the same request being sent multiple times? i;e if this is an UploadPart
API call that should be retried, there would a partID
associated with that particular call. In a retry scenario, you'll see multiple UploadPart
calls with the same partID
Can you share some of those logs with us? (please redact the signature and principal ID from the logs).
Thanks! Ran~
Hi @RanVaknin,
I've attached the logs to the internal AWS ticket that I filed. It does appear that one of the parts which ultimately failed was in fact retried 5 times, but my php client is configured to retry 3 times. I believe this may indicate that cURL is retrying but the php-sdk is not.
Please let me know if you find out anything and if you would like any more information.
Thank You!
Hi @pdumoulin ,
Thanks for providing these logs. From reviewing the error messages, I can see the part you are asking about:
upload-1 | * Connection died, retrying a fresh connect (retry count: 5)
upload-1 | * Closing connection 125
upload-1 | * Issue another request to this URL:REDACTED
upload-1 | * Found bundle for host: 0x555555f1f960 [serially]
upload-1 | * Re-using existing connection #127 with host REDACTED
upload-1 | > PUT /REDACTED
upload-1 | Host: REDACTED
upload-1 | Expect: 100-Continue
upload-1 | X-Amz-User-Agent: REDACTED cfg/retry-mode#legacy
upload-1 | Content-Type: audio/x-flac
upload-1 | aws-sdk-invocation-id: REDACTED
upload-1 | aws-sdk-retry: 0/0
upload-1 | x-amz-content-sha256: REDACTED
upload-1 | X-Amz-Date: REDACTED
upload-1 | X-Amz-Security-Token: REDACTED
upload-1 | Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20241003/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token;x-amz-user-agent, Signature=REDACTED
upload-1 | User-Agent: REDACTED cfg/retry-mode#legacy GuzzleHttp/7
upload-1 | Content-Length: REDACTED
upload-1 |
upload-1 | * Connection died, tried 5 times before giving up
upload-1 | * Closing connection 127
The retry count shown here comes from the underlying http client (Guzzle) retrying the connection, and not the SDK. So even if you set the SDK to retry 3 times, the Guzzle http client's default might be 5 therefore explaining this discrepancy.
Based on other reports of that same issue, this difficulty of establishing a connection is not specific to the PHP SDK (not sure about the CLI, perhaps the multipart upload was not done concurrently? ) but is reported across many SDKs and likely points to an issue with S3 itself. The workaround discussed was to configure a higher retry count, and after 10-20 retry attempts the http client was able to establish that connection.
One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.
Another thing I can see from your user-agent is that the retry mode is set to legacy
. I doubt that this will change anything, but can you try to set it to adaptive
? https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html
@stobrien89 any thoughts?
Thanks, Ran~
My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.
One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.
I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.
Another thing I can see from your user-agent is that the retry mode is set to legacy
I already ran the same tests with all retry modes, and the uploads failed in the same fashion.
Hi @pdumoulin ,
A few more clarifications after speaking to @stobrien89:
The 0/0 retry header is a red herring. The number on the left side indicates the retry attempt, and the number on right indicates the retry delay which is 0 by default. So you should see 0/0, 1/0, 2/0 etc.
This issue with connection management is likely manifesting on the service side because of high concurrency. Are you able to try lowering the concurrency number to 20, or 10, or even removing it altogether? I understand that you want to use concurrent uploads, but we just want to observe the behavior when there is less concurrency involved (or none at all).
You can pass another verbose debug flag down to the cURL http handler like so:
$options = array(
'debug' => true,
'http' => [
'curl' => [
CURLOPT_VERBOSE => true,
],
]
This should expose some networking level logs that might give us some more info.
As far as the solution / workaround, my previous statement still stands. We need to raise the retry limit on the Guzzle client to try and force the server to re-establish a connection, so that suggestion remains the same.
Thanks again, Ran~
Hi,
I missed your latest comment so responding separately here.
My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.
I don't think this is the case. A connection error is a networking level event that is handled by the underlying HTTP client and not by the application layer. The SDK is meant to retry any retryable error that is tied to an HTTP status code like a 5xx error or a 429 or any other error that is defined as a retryable errors. From your logs it seems like the failing UploadPart calls have never roundtripped, therefore there is nothing for the SDK to "hook into" in order to determine if an error is retryable or not.
Admittedly, I'm basing my knowledge on the behavior of other SDKs I support. I'm more familiar with like the Go and the JS SDK which have some support for retrying networking level errors. From the stack trace, it seems like there is an issue with establishing a connection rather than the typical I/O bound retry strategy. I will review this again with the team to make sure I'm not misleading you here.
I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.
I agree with the sentiment here. Customers should not implement patches for the underlying http client. But there are two caveats here:
Thanks, Ran~
Hi @RanVaknin,
Thank you for your thoughtful and detailed responses. I appreciate the time and attention to details.
Firstly, I ran and planned some more tests as per your suggestions, here is how they went...
Secondly, I also looked into how the aws-php-sdk is handling networking level errors and found that (if I'm reading the code correctly) it is setup to retry cURL errors for CURLE_RECV_ERROR
which is code 56
according to these docs. The error we are experiencing is CURLE_SEND_ERROR
which is code 55
. It does appear that the sdk is capable of hooking into this error, but it is not configured to retry. Based on the detailed error message, I think this could be caused by the server closing the connection and it should be safe to retry a brand new connection at the sdk level. Please let me know what you think.
Hi @pdumoulin,
Sorry for the issues and thanks for your patience thus far. I suspect this issue is coming from the S3 side— we've had reports of slowdowns and general connection issues since they've enabled TLS 1.3. I'd be curious to see your results after forcing TLS 1.2 in the S3 client by adding the following to your top-level configuration:
'http' => [
'curl' => [
CURLOPT_SSLVERSION => CURL_SSLVERSION_MAX_TLSv1_2,
]
]
@stobrien89 - I gave your suggestion a try and received the same error AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up
. Any thoughts about adding cURL error code 55
as retryable as mentioned previously?
@pdumoulin,
I was able to reproduce the issue and it appears to be a network bottleneck caused by a combination of the number of concurrent requests and the number of multipart workflows (for instance, a 50 mb file will require 12 separate requests). I don't think retrying cURL send errors will alleviate this. I'll need to dig deeper on the specific limiting factors, but in the meantime, you could try tweaking the mup_threshold
and concurrency
options until you find a sweet spot. For reference, I was able to complete the uploads successfully by increasing the mup_threshold
to 50mb (but you can try lower) and dropping the concurrency
to 20 (but you can try higher).
Firstly, I tried tweaking settings to reduce network requests, all the way down to...
concurrency
=> 5mup_threshold
=> 50mbpart_size
=> 50mbDuring every test, I encountered the same error as before.
Secondly, I am testing with 73 files, only 3 of which are larger than 50 mb. I tried setting mup_threshold
to 100mb, essentially turning off multi-part uploads and it worked. I am going to run some more tests, to see if I can find a "sweet spot", but for our use case, we have to be able to support much larger files and setting mup_threshold
much higher is probably not an option.
@stobrien89 @RanVaknin
We ran a test on real data with...
mup_threshold
=> 25mbpart_size
=> 10mbAll other settings as default.
The data that was giving us consistent problems in the past went through without an issue. It's a small data set, but encouraging.
I don't think we can consider this completely resolved, but I hope changing these settings will at least stabilize the behaviour for our use case.
Thank you for the help, and please let me know if there are any further changes in the client or server side regarding this issue.
@stobrien89 - although our test case is stable, our production system is still experiencing these errors
we will probably try tweaking the setting some more, but I wanted to check in on a couple things...
Describe the bug
Using the aws-php-sdk transfer function to upload a directory of files always fails at some point during the process.
Expected Behavior
I would expect these operations would be retried by the SDK or these sort of error would be very rare.
Current Behavior
Uploading a batch of 73 files, ranging in size from 58MB to 2MB consistently fails sometime in the middle of the batch with one of the following errors...
AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up
AWS HTTP error: cURL error 52: Empty reply from server
RequestTimeout (client): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
Reproduction Steps
The following php script runs in a debian12 docker container using php82.
Possible Solution
No response
Additional Information/Context
FLAC
audio filesSDK version used
3.321.0
Environment details (Version of PHP (
php -v
)? OS name and version, etc.)php 8.2.22 ; OpenSSL 3.0.13 ; curl 7.88.1 ; debian12 (docker container)