transfer upload batch of files always fails

pdumoulin commented 2 months ago

Describe the bug

Using the aws-php-sdk transfer function to upload a directory of files always fails at some point during the process.

Expected Behavior

I would expect these operations would be retried by the SDK or these sort of error would be very rare.

Current Behavior

Uploading a batch of 73 files, ranging in size from 58MB to 2MB consistently fails sometime in the middle of the batch with one of the following errors...

AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up
AWS HTTP error: cURL error 52: Empty reply from server
RequestTimeout (client): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.

Reproduction Steps

The following php script runs in a debian12 docker container using php82.

<?php

require __DIR__ . '/vendor/autoload.php';

use Aws\S3\S3Client;

$bucket = 'redacted';

# setup client
$connection = S3Client::factory();

$options = array(
    'debug' => true,
    'concurrency' => 40,
    'mup_threshold' => 16777216
);
$manager = new \Aws\S3\Transfer(
    $connection,
    "/tmp/input/",
    "s3://$bucket/php-upload-test",
    $options
);
$manager->transfer();

Possible Solution

No response

Additional Information/Context

Possibly related to: https://github.com/aws/aws-sdk-go-v2/issues/2657
The description above is from a setup used to re-created the issue locally on a development laptop. The issue occurs in a production system running in AWS Fargate.
No problems have occured when doing similar operations using the AWS-CLI or python AWS SDK
All files being uploaded are FLAC audio files

SDK version used

3.321.0

Environment details (Version of PHP (`php -v`)? OS name and version, etc.)

php 8.2.22 ; OpenSSL 3.0.13 ; curl 7.88.1 ; debian12 (docker container)

RanVaknin commented 2 months ago

Hi @pdumoulin ,

This does seem related to the issue you linked, and also https://github.com/aws/aws-sdk-php/issues/29 and others. This seems to surface from a service side intermittent issue.

In all of the other issues with the same symptoms, the problem went away after a few retries. Are you able to configure the retryer to retry the upload 5-10 times and see if it makes the problem go away?

I have already engaged the S3 team internally to try and get this diagnosed and fixed, but so far haven't heard back.

Let me know if the retry strategy works for you. Thanks, Ran~

pdumoulin commented 2 months ago

Hi @RanVaknin,

I set the retry count to 20 and tried all three retry modes, legacy, adaptive, and standard as described here. The batch was unable to be completed in each case, encountering the same errors as above.

Due to the fact that the average runtime did not increase when the retry count was increased, and the aws-cli in the exact same environment is not encountering these issues, I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.

Please let me know if there is anything else I can do to help gather some more information about the issue and if there are any updates from the S3 team internally!

RanVaknin commented 1 month ago

Hi @pdumoulin ,

Apologies for the late response. I got your internal ticket thank you for that.

I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.

You can verify whether the retries are taking effect or not by enabling the request logs:

$connection = S3Client::factory([
    'debug' => true
]);

I will update your support person with next steps in order to engage S3.

Thanks, Ran~

pdumoulin commented 1 month ago

Hi @RanVaknin ,

I set the debug level in the client factory as you suggested and was able to gather a lot more log information. I ran the same test and see aws-sdk-retry headers being sent. Their value is always 0/0 never anything else. Can you explain what this value means? It would appear retries are never attempted, but my retry config is set for a retry count of 3, so I would expect values like 0/3 or 1/3 if I am reading the header correctly.

Please let me know if there are any other data points I can extract from the logs to help.

RanVaknin commented 1 month ago

Hi @pdumoulin ,

I'm not sure why the header is always 0/0. If I had to hypothesize, it's likely because of the challenge in updating this value when code is being ran concurrently. Instead of looking at the header value when inspecting the request logs, do you see the same request being sent multiple times? i;e if this is an UploadPart API call that should be retried, there would a partIDassociated with that particular call. In a retry scenario, you'll see multiple UploadPart calls with the same partID

Can you share some of those logs with us? (please redact the signature and principal ID from the logs).

Thanks! Ran~

pdumoulin commented 1 month ago

Hi @RanVaknin,

I've attached the logs to the internal AWS ticket that I filed. It does appear that one of the parts which ultimately failed was in fact retried 5 times, but my php client is configured to retry 3 times. I believe this may indicate that cURL is retrying but the php-sdk is not.

Please let me know if you find out anything and if you would like any more information.

Thank You!

RanVaknin commented 1 month ago

Hi @pdumoulin ,

Thanks for providing these logs. From reviewing the error messages, I can see the part you are asking about:

upload-1  | * Connection died, retrying a fresh connect (retry count: 5)
upload-1  | * Closing connection 125
upload-1  | * Issue another request to this URL:REDACTED
upload-1  | * Found bundle for host: 0x555555f1f960 [serially]
upload-1  | * Re-using existing connection #127 with host REDACTED
upload-1  | > PUT /REDACTED
upload-1  | Host: REDACTED
upload-1  | Expect: 100-Continue
upload-1  | X-Amz-User-Agent: REDACTED cfg/retry-mode#legacy
upload-1  | Content-Type: audio/x-flac
upload-1  | aws-sdk-invocation-id: REDACTED
upload-1  | aws-sdk-retry: 0/0
upload-1  | x-amz-content-sha256: REDACTED
upload-1  | X-Amz-Date: REDACTED
upload-1  | X-Amz-Security-Token: REDACTED
upload-1  | Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20241003/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token;x-amz-user-agent, Signature=REDACTED
upload-1  | User-Agent: REDACTED cfg/retry-mode#legacy GuzzleHttp/7
upload-1  | Content-Length: REDACTED
upload-1  | 
upload-1  | * Connection died, tried 5 times before giving up
upload-1  | * Closing connection 127

The retry count shown here comes from the underlying http client (Guzzle) retrying the connection, and not the SDK. So even if you set the SDK to retry 3 times, the Guzzle http client's default might be 5 therefore explaining this discrepancy.

Based on other reports of that same issue, this difficulty of establishing a connection is not specific to the PHP SDK (not sure about the CLI, perhaps the multipart upload was not done concurrently? ) but is reported across many SDKs and likely points to an issue with S3 itself. The workaround discussed was to configure a higher retry count, and after 10-20 retry attempts the http client was able to establish that connection.

One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.

Another thing I can see from your user-agent is that the retry mode is set to legacy. I doubt that this will change anything, but can you try to set it to adaptive? https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html

@stobrien89 any thoughts?

Thanks, Ran~

pdumoulin commented 1 month ago

My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.

One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.

I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.

Another thing I can see from your user-agent is that the retry mode is set to legacy

I already ran the same tests with all retry modes, and the uploads failed in the same fashion.

RanVaknin commented 1 month ago

Hi @pdumoulin ,

A few more clarifications after speaking to @stobrien89:

The 0/0 retry header is a red herring. The number on the left side indicates the retry attempt, and the number on right indicates the retry delay which is 0 by default. So you should see 0/0, 1/0, 2/0 etc.
This issue with connection management is likely manifesting on the service side because of high concurrency. Are you able to try lowering the concurrency number to 20, or 10, or even removing it altogether? I understand that you want to use concurrent uploads, but we just want to observe the behavior when there is less concurrency involved (or none at all).
You can pass another verbose debug flag down to the cURL http handler like so:

$options = array(
    'debug' => true,
    'http' => [
        'curl' => [
            CURLOPT_VERBOSE => true, 
        ],
    ]

This should expose some networking level logs that might give us some more info.

As far as the solution / workaround, my previous statement still stands. We need to raise the retry limit on the Guzzle client to try and force the server to re-establish a connection, so that suggestion remains the same.

Thanks again, Ran~

RanVaknin commented 1 month ago

Hi,

I missed your latest comment so responding separately here.

My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.

I don't think this is the case. A connection error is a networking level event that is handled by the underlying HTTP client and not by the application layer. The SDK is meant to retry any retryable error that is tied to an HTTP status code like a 5xx error or a 429 or any other error that is defined as a retryable errors. From your logs it seems like the failing UploadPart calls have never roundtripped, therefore there is nothing for the SDK to "hook into" in order to determine if an error is retryable or not.

Admittedly, I'm basing my knowledge on the behavior of other SDKs I support. I'm more familiar with like the Go and the JS SDK which have some support for retrying networking level errors. From the stack trace, it seems like there is an issue with establishing a connection rather than the typical I/O bound retry strategy. I will review this again with the team to make sure I'm not misleading you here.

I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.

I agree with the sentiment here. Customers should not implement patches for the underlying http client. But there are two caveats here:

Networking level events are not visible to the application layer. In other words, the PHP SDK cannot retry this error because it happens in the networking layer and is not I/O bound.
The reason I ask you to increase the retry count on the Guzzle client is not to fix your problem, but rather to prove that this issue you are having is related to https://github.com/aws/aws-sdk-go-v2/issues/2657. If that is the case, the fix would have to be on the server side with S3 not with the SDK team as it pertains to how they handle concurrent connection pooling on the service side.

Thanks, Ran~

pdumoulin commented 1 month ago

Hi @RanVaknin,

Thank you for your thoughtful and detailed responses. I appreciate the time and attention to details.

Firstly, I ran and planned some more tests as per your suggestions, here is how they went...

decreased the concurrency to 5 (we are using the sdk default in our production workloads) => the failures still happened in the same way
added curl debugging option => php memory usage skyrockets, even after setting it to 20GB, memory is exhausted in under a minute, not enough time for the bug to be re-created consistently
custom guzzle middleware => I am going to need more time to set this up and run tests

Secondly, I also looked into how the aws-php-sdk is handling networking level errors and found that (if I'm reading the code correctly) it is setup to retry cURL errors for CURLE_RECV_ERROR which is code 56 according to these docs. The error we are experiencing is CURLE_SEND_ERROR which is code 55. It does appear that the sdk is capable of hooking into this error, but it is not configured to retry. Based on the detailed error message, I think this could be caused by the server closing the connection and it should be safe to retry a brand new connection at the sdk level. Please let me know what you think.

stobrien89 commented 1 month ago

Hi @pdumoulin,

Sorry for the issues and thanks for your patience thus far. I suspect this issue is coming from the S3 side— we've had reports of slowdowns and general connection issues since they've enabled TLS 1.3. I'd be curious to see your results after forcing TLS 1.2 in the S3 client by adding the following to your top-level configuration:

 'http' => [
        'curl' => [
            CURLOPT_SSLVERSION => CURL_SSLVERSION_MAX_TLSv1_2, 
        ]
    ]

pdumoulin commented 1 month ago

@stobrien89 - I gave your suggestion a try and received the same error AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up. Any thoughts about adding cURL error code 55 as retryable as mentioned previously?

stobrien89 commented 1 month ago

@pdumoulin,

I was able to reproduce the issue and it appears to be a network bottleneck caused by a combination of the number of concurrent requests and the number of multipart workflows (for instance, a 50 mb file will require 12 separate requests). I don't think retrying cURL send errors will alleviate this. I'll need to dig deeper on the specific limiting factors, but in the meantime, you could try tweaking the mup_threshold and concurrency options until you find a sweet spot. For reference, I was able to complete the uploads successfully by increasing the mup_threshold to 50mb (but you can try lower) and dropping the concurrency to 20 (but you can try higher).

pdumoulin commented 1 month ago

Firstly, I tried tweaking settings to reduce network requests, all the way down to...

concurrency => 5
mup_threshold => 50mb
part_size => 50mb

During every test, I encountered the same error as before.

Secondly, I am testing with 73 files, only 3 of which are larger than 50 mb. I tried setting mup_threshold to 100mb, essentially turning off multi-part uploads and it worked. I am going to run some more tests, to see if I can find a "sweet spot", but for our use case, we have to be able to support much larger files and setting mup_threshold much higher is probably not an option.

pdumoulin commented 1 month ago

@stobrien89 @RanVaknin

We ran a test on real data with...

mup_threshold => 25mb
part_size => 10mb

All other settings as default.

The data that was giving us consistent problems in the past went through without an issue. It's a small data set, but encouraging.

I don't think we can consider this completely resolved, but I hope changing these settings will at least stabilize the behaviour for our use case.

Thank you for the help, and please let me know if there are any further changes in the client or server side regarding this issue.

pdumoulin commented 3 weeks ago

@stobrien89 - although our test case is stable, our production system is still experiencing these errors

we will probably try tweaking the setting some more, but I wanted to check in on a couple things...

have you learned anything more about "specific limiting factors" mentioned previously?
can you explain more why adding an additional cURL error code to retry would not help? are you sure that retrying the same request after 30s or 60s would always fail again?

stobrien89 commented 2 weeks ago

Hi @pdumoulin,

Very sorry for the late reply. Glad you were able to find a workaround for the time being. I haven't had a chance to do a more thorough investigation, but will let you know as soon as I know more.

can you explain more why adding an additional cURL error code to retry would not help? are you sure that retrying the same request after 30s or 60s would always fail again?

I'm not certain it wouldn't help, but since this seems like a resource/bandwidth issue I'm not optimistic. Although it seems like you've had success with the CLI/Python SDK, I thought it might be helpful to include this snippet from their transfer config documentation: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests. At the very least, I think we'll need to update our documentation with a similar warning.

pdumoulin commented 1 week ago

It's worth noting that we experienced failures with the aws-php-sdk's default concurrency setting of "3" and success with the aws-cli's default concurrency setting of "10" during tests which uploaded a single directly. However, in our actual production workloads, there are a variable number of clients making requests, so setting the concurrency per client only goes so far.

I keep asking about retries because I am assuming that any resource/bandwidth issue would be ephemeral and a retry, especially with a long backoff, would alleviate the issue. For our use case, delayed success is much better than quick failure.

If it's possible to create a release candidate or beta version of the aws-php-sdk with retries for this specific cURL error, we would be happy to test it out.

RanVaknin commented 1 week ago

Hi @pdumoulin ,

I gave reproducing this issue a fair shake. Created the same amount of .flac files with various file sizes corresponding to the info you've provided. However After 10 executions in a row, I'm not able to see any curl issues manifesting.

The reason I brought up retries in the first place was to see if it can point to service side issue manifesting across multiple SDKs which based on other similar issues, went away after a retry. What @stobrien89 said about bandwidth makes sense.

As far as retrying this specific curl error blindly for all SDK customers, I'm not sure what would be the implications. That's why I suggested you implement your own middleware to begin with, that way you can modify the retry behavior for your use case to see if it solves your issue, so that we approach S3 with some concrete examples in order for them to fix this, and in order to unblock you for the time being.

I would have written my own middleware and given you the actual code, but Im not able to synthetically raise the curl error in order to test a middleware against it, but maybe you can. Can you refer to this comment for an example of how you'd implement a simple retry middleware hook?

Any other reproduction details you can provide us with so I can continue to try reproducing this and raising this curl error myself?

Thanks again, Ran~

pdumoulin commented 1 week ago

@RanVaknin @stobrien89 - I setup a slate of tests in different environments using Docker that can be used to re-create the issue, you can pull it down and run them yourself from pdumoulin/php-aws-bulk-upload-sample. Running the experiments in each case either fails 100% of the time (noted by :x:) or succeeds 100% of the time (noted by ✅ ). The setup "Debian Bookworm" most closely matches our production workloads.

I am going to continue to investigate this on my end to try and figure out what might be causing the different behaviour between environments. Please give this process a try and let me know if you can reproduce the same results, and if you have any other ideas to test or metrics to record.

pdumoulin commented 1 week ago

@RanVaknin @stobrien89 - I ran a few more tests, tweaking cURL options, and here's what I have observed...

CURLOPT_MAXLIFETIME_CONN => lowering this value makes the issue less likely to occur
CURLOPT_FORBID_REUSE => setting this value to false eliminates the issue

This is what I think might be happening...

Chunk of data is sent, then cURL keeps the connection open in a pool to re-use
Something closes connections in the pool but cURL doesn't realize this
After 5 retries on closed connections in the pool, a fatal exception occurs

I'm still perplexed as to why this would occur in a Debian container and not an Alpine one. I diff-ed all the sysctl network settings between the two containers and don't see many changes or anything that would explain this.

I think we are going to try setting CURLOPT_FORBID_REUSE to false, it may slow down performance a bit, but that is a trade-off that we are willing to make for our use case.

pdumoulin commented 4 days ago

@RanVaknin @stobrien89 - We implemented the above change to set CURLOPT_FORBID_REUSE to false two days ago, and all cURL error 55 codes have disappeared from our logs and the system is much more stable. We are seeing a very low amount of 35 and 56 error codes as well now, but not enough to be of concern at the moment.

In my opinion, this points to the issue not being about bandwidth and retries being acceptable. Please see my reproduction steps above for further investigation and to help with potential changes.

aws / aws-sdk-php