aws / aws-sdk-php

Official repository of the AWS SDK for PHP (@awsforphp)
http://aws.amazon.com/sdkforphp
Apache License 2.0
6.02k stars 1.22k forks source link

transfer upload batch of files always fails #2992

Open pdumoulin opened 2 months ago

pdumoulin commented 2 months ago

Describe the bug

Using the aws-php-sdk transfer function to upload a directory of files always fails at some point during the process.

Expected Behavior

I would expect these operations would be retried by the SDK or these sort of error would be very rare.

Current Behavior

Uploading a batch of 73 files, ranging in size from 58MB to 2MB consistently fails sometime in the middle of the batch with one of the following errors...

Reproduction Steps

The following php script runs in a debian12 docker container using php82.

<?php

require __DIR__ . '/vendor/autoload.php';

use Aws\S3\S3Client;

$bucket = 'redacted';

# setup client
$connection = S3Client::factory();

$options = array(
    'debug' => true,
    'concurrency' => 40,
    'mup_threshold' => 16777216
);
$manager = new \Aws\S3\Transfer(
    $connection,
    "/tmp/input/",
    "s3://$bucket/php-upload-test",
    $options
);
$manager->transfer();

Possible Solution

No response

Additional Information/Context

SDK version used

3.321.0

Environment details (Version of PHP (php -v)? OS name and version, etc.)

php 8.2.22 ; OpenSSL 3.0.13 ; curl 7.88.1 ; debian12 (docker container)

RanVaknin commented 2 months ago

Hi @pdumoulin ,

This does seem related to the issue you linked, and also https://github.com/aws/aws-sdk-php/issues/29 and others. This seems to surface from a service side intermittent issue.

In all of the other issues with the same symptoms, the problem went away after a few retries. Are you able to configure the retryer to retry the upload 5-10 times and see if it makes the problem go away?

I have already engaged the S3 team internally to try and get this diagnosed and fixed, but so far haven't heard back.

Let me know if the retry strategy works for you. Thanks, Ran~

pdumoulin commented 2 months ago

Hi @RanVaknin,

I set the retry count to 20 and tried all three retry modes, legacy, adaptive, and standard as described here. The batch was unable to be completed in each case, encountering the same errors as above.

Due to the fact that the average runtime did not increase when the retry count was increased, and the aws-cli in the exact same environment is not encountering these issues, I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.

Please let me know if there is anything else I can do to help gather some more information about the issue and if there are any updates from the S3 team internally!

RanVaknin commented 1 month ago

Hi @pdumoulin ,

Apologies for the late response. I got your internal ticket thank you for that.

I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases.

You can verify whether the retries are taking effect or not by enabling the request logs:

$connection = S3Client::factory([
    'debug' => true
]);

I will update your support person with next steps in order to engage S3.

Thanks, Ran~

pdumoulin commented 1 month ago

Hi @RanVaknin ,

I set the debug level in the client factory as you suggested and was able to gather a lot more log information. I ran the same test and see aws-sdk-retry headers being sent. Their value is always 0/0 never anything else. Can you explain what this value means? It would appear retries are never attempted, but my retry config is set for a retry count of 3, so I would expect values like 0/3 or 1/3 if I am reading the header correctly.

Please let me know if there are any other data points I can extract from the logs to help.

RanVaknin commented 4 weeks ago

Hi @pdumoulin ,

I'm not sure why the header is always 0/0. If I had to hypothesize, it's likely because of the challenge in updating this value when code is being ran concurrently. Instead of looking at the header value when inspecting the request logs, do you see the same request being sent multiple times? i;e if this is an UploadPart API call that should be retried, there would a partIDassociated with that particular call. In a retry scenario, you'll see multiple UploadPart calls with the same partID

Can you share some of those logs with us? (please redact the signature and principal ID from the logs).

Thanks! Ran~

pdumoulin commented 3 weeks ago

Hi @RanVaknin,

I've attached the logs to the internal AWS ticket that I filed. It does appear that one of the parts which ultimately failed was in fact retried 5 times, but my php client is configured to retry 3 times. I believe this may indicate that cURL is retrying but the php-sdk is not.

Please let me know if you find out anything and if you would like any more information.

Thank You!

RanVaknin commented 3 weeks ago

Hi @pdumoulin ,

Thanks for providing these logs. From reviewing the error messages, I can see the part you are asking about:

upload-1  | * Connection died, retrying a fresh connect (retry count: 5)
upload-1  | * Closing connection 125
upload-1  | * Issue another request to this URL:REDACTED
upload-1  | * Found bundle for host: 0x555555f1f960 [serially]
upload-1  | * Re-using existing connection #127 with host REDACTED
upload-1  | > PUT /REDACTED
upload-1  | Host: REDACTED
upload-1  | Expect: 100-Continue
upload-1  | X-Amz-User-Agent: REDACTED cfg/retry-mode#legacy
upload-1  | Content-Type: audio/x-flac
upload-1  | aws-sdk-invocation-id: REDACTED
upload-1  | aws-sdk-retry: 0/0
upload-1  | x-amz-content-sha256: REDACTED
upload-1  | X-Amz-Date: REDACTED
upload-1  | X-Amz-Security-Token: REDACTED
upload-1  | Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20241003/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token;x-amz-user-agent, Signature=REDACTED
upload-1  | User-Agent: REDACTED cfg/retry-mode#legacy GuzzleHttp/7
upload-1  | Content-Length: REDACTED
upload-1  | 
upload-1  | * Connection died, tried 5 times before giving up
upload-1  | * Closing connection 127

The retry count shown here comes from the underlying http client (Guzzle) retrying the connection, and not the SDK. So even if you set the SDK to retry 3 times, the Guzzle http client's default might be 5 therefore explaining this discrepancy.

Based on other reports of that same issue, this difficulty of establishing a connection is not specific to the PHP SDK (not sure about the CLI, perhaps the multipart upload was not done concurrently? ) but is reported across many SDKs and likely points to an issue with S3 itself. The workaround discussed was to configure a higher retry count, and after 10-20 retry attempts the http client was able to establish that connection.

One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.

Another thing I can see from your user-agent is that the retry mode is set to legacy. I doubt that this will change anything, but can you try to set it to adaptive? https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html

@stobrien89 any thoughts?

Thanks, Ran~

pdumoulin commented 3 weeks ago

My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.

One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation.

I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.

Another thing I can see from your user-agent is that the retry mode is set to legacy

I already ran the same tests with all retry modes, and the uploads failed in the same fashion.

RanVaknin commented 3 weeks ago

Hi @pdumoulin ,

A few more clarifications after speaking to @stobrien89:

$options = array(
    'debug' => true,
    'http' => [
        'curl' => [
            CURLOPT_VERBOSE => true, 
        ],
    ]

This should expose some networking level logs that might give us some more info.

As far as the solution / workaround, my previous statement still stands. We need to raise the retry limit on the Guzzle client to try and force the server to re-establish a connection, so that suggestion remains the same.

Thanks again, Ran~

RanVaknin commented 3 weeks ago

Hi,

I missed your latest comment so responding separately here.

My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.

I don't think this is the case. A connection error is a networking level event that is handled by the underlying HTTP client and not by the application layer. The SDK is meant to retry any retryable error that is tied to an HTTP status code like a 5xx error or a 429 or any other error that is defined as a retryable errors. From your logs it seems like the failing UploadPart calls have never roundtripped, therefore there is nothing for the SDK to "hook into" in order to determine if an error is retryable or not.

Admittedly, I'm basing my knowledge on the behavior of other SDKs I support. I'm more familiar with like the Go and the JS SDK which have some support for retrying networking level errors. From the stack trace, it seems like there is an issue with establishing a connection rather than the typical I/O bound retry strategy. I will review this again with the team to make sure I'm not misleading you here.

I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.

I agree with the sentiment here. Customers should not implement patches for the underlying http client. But there are two caveats here:

  1. Networking level events are not visible to the application layer. In other words, the PHP SDK cannot retry this error because it happens in the networking layer and is not I/O bound.
  2. The reason I ask you to increase the retry count on the Guzzle client is not to fix your problem, but rather to prove that this issue you are having is related to https://github.com/aws/aws-sdk-go-v2/issues/2657. If that is the case, the fix would have to be on the server side with S3 not with the SDK team as it pertains to how they handle concurrent connection pooling on the service side.

Thanks, Ran~

pdumoulin commented 3 weeks ago

Hi @RanVaknin,

Thank you for your thoughtful and detailed responses. I appreciate the time and attention to details.

Firstly, I ran and planned some more tests as per your suggestions, here is how they went...

  1. decreased the concurrency to 5 (we are using the sdk default in our production workloads) => the failures still happened in the same way
  2. added curl debugging option => php memory usage skyrockets, even after setting it to 20GB, memory is exhausted in under a minute, not enough time for the bug to be re-created consistently
  3. custom guzzle middleware => I am going to need more time to set this up and run tests

Secondly, I also looked into how the aws-php-sdk is handling networking level errors and found that (if I'm reading the code correctly) it is setup to retry cURL errors for CURLE_RECV_ERROR which is code 56 according to these docs. The error we are experiencing is CURLE_SEND_ERROR which is code 55. It does appear that the sdk is capable of hooking into this error, but it is not configured to retry. Based on the detailed error message, I think this could be caused by the server closing the connection and it should be safe to retry a brand new connection at the sdk level. Please let me know what you think.

stobrien89 commented 2 weeks ago

Hi @pdumoulin,

Sorry for the issues and thanks for your patience thus far. I suspect this issue is coming from the S3 side— we've had reports of slowdowns and general connection issues since they've enabled TLS 1.3. I'd be curious to see your results after forcing TLS 1.2 in the S3 client by adding the following to your top-level configuration:

 'http' => [
        'curl' => [
            CURLOPT_SSLVERSION => CURL_SSLVERSION_MAX_TLSv1_2, 
        ]
    ]
pdumoulin commented 2 weeks ago

@stobrien89 - I gave your suggestion a try and received the same error AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up. Any thoughts about adding cURL error code 55 as retryable as mentioned previously?

stobrien89 commented 2 weeks ago

@pdumoulin,

I was able to reproduce the issue and it appears to be a network bottleneck caused by a combination of the number of concurrent requests and the number of multipart workflows (for instance, a 50 mb file will require 12 separate requests). I don't think retrying cURL send errors will alleviate this. I'll need to dig deeper on the specific limiting factors, but in the meantime, you could try tweaking the mup_threshold and concurrency options until you find a sweet spot. For reference, I was able to complete the uploads successfully by increasing the mup_threshold to 50mb (but you can try lower) and dropping the concurrency to 20 (but you can try higher).

pdumoulin commented 1 week ago

Firstly, I tried tweaking settings to reduce network requests, all the way down to...

During every test, I encountered the same error as before.

Secondly, I am testing with 73 files, only 3 of which are larger than 50 mb. I tried setting mup_threshold to 100mb, essentially turning off multi-part uploads and it worked. I am going to run some more tests, to see if I can find a "sweet spot", but for our use case, we have to be able to support much larger files and setting mup_threshold much higher is probably not an option.

pdumoulin commented 1 week ago

@stobrien89 @RanVaknin

We ran a test on real data with...

All other settings as default.

The data that was giving us consistent problems in the past went through without an issue. It's a small data set, but encouraging.

I don't think we can consider this completely resolved, but I hope changing these settings will at least stabilize the behaviour for our use case.

Thank you for the help, and please let me know if there are any further changes in the client or server side regarding this issue.

pdumoulin commented 3 days ago

@stobrien89 - although our test case is stable, our production system is still experiencing these errors

we will probably try tweaking the setting some more, but I wanted to check in on a couple things...