aws / aws-sdk-php

Official repository of the AWS SDK for PHP (@awsforphp)
http://aws.amazon.com/sdkforphp
Apache License 2.0
6.02k stars 1.22k forks source link

MultiPartCopy of large files (50-500GB) results in corrupted files when parts fail #2055

Closed somnaderi closed 2 years ago

somnaderi commented 4 years ago

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug When doing MultiPartCopy of large files (50GB+), failing parts can result in corrupted files at the destination. The files at the destination will be significantly larger than the original file (sometimes close to double the size).

While I've seen several transfers succeed despite failures, this does not always appear to be the case. From my perspective, I'm informed of a large file being corrupted in a downstream service and in every one of those cases, checking the logs, the MultiPartCopy had some kind of error that it had to recover from. That said, there are other instances of error recoveries where the outputs were fine/not corrupted.

A couple of the error types that have triggered this are:

` 2020-07-02T02:42:45.056396-07:00: event=multipartUploadFailed level=warning attempts=0 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:

2020-07-02T07:30:04.365831-07:00: event=multipartUploadFailed level=warning attempts=0 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:

The code I'm using is similar to the sample code provided that has a try/catch within a do/while in order to recover from these kinds of failures. My only real change is adding a maximum number of attempts to retry.

Note that, according to my log, there is never more than one retry needed/attempted -- the transfer behaves as if it was successful and it's not until further downstream in our workflow that it's determined that the file transfer resulted in a corrupted file. My workaround for this will be to do a file size check after a successful copy and confirm there's a match, and restart the entire copy if not. This is obviously not ideal because some of these files will be 300-500GB and will waste several hours.

`

    $this->s3 = new S3Client([
        'version' => 'latest',
        'region'  => $_ENV['AWS_REGION'],     // this is us-east-1 for this instance
        'retries'  => 3,
        'credentials' => new Credentials(<parameters removed>)
    ]);

    // NOTE -- this is not used by the copier, but I'm including it in case the context matters
    $this->s3->registerStreamWrapper();

    $partNumber = 0;
    $percentComplete = 0;
    $partSize = 536870912; // 512 MiB
    if (isset($_ENV['AWS_S3_MULTIPART_PARTSIZE']) === true && intval($_ENV['AWS_S3_MULTIPART_PARTSIZE']) > 536870912) {
        $partSize = intval($_ENV['AWS_S3_MULTIPART_PARTSIZE']);
    }

    $copier = new MultipartCopy(
        $this->s3,
        $sourceBucket.'/'.$sourceKey,
        [
            'bucket' => $destinationBucket,
            'key' => $destinationKey,
            'part_size' => $partSize,
            'before_upload' => function () use ($bytesToCopy, $partSize, &$partNumber, &$asset) {
                $percentComplete = intval(100.0 * round((float) ($partNumber * $partSize) / (float) $bytesToCopy, 2));

                $this->logger->log([
                    'event' => 'debug_moveS3Object',
                    'level' => 'debug',
                    'assetId' => $asset->getId(),
                    'bytesToCopy' => $bytesToCopy,
                    'percentComplete' => $percentComplete
                ]);

                $asset->setCopyProgress(intval($percentComplete));
                $this->entityManager->flush();

                $partNumber++;
            }
        ]
    );

    $attempts = 0;
    $errorMsg = null;
    $result = null;
    $maxAttempts = 3;
    if (isset($_ENV['AWS_S3_MULTIPART_MAXATTEMPTS']) === true && intval($_ENV['AWS_S3_MULTIPART_MAXATTEMPTS']) > 0) {
        $maxAttempts = intval($_ENV['AWS_S3_MULTIPART_MAXATTEMPTS']);
    }

    do {
        try {
            $result = $copier->copy();
        } catch (MultipartUploadException $e) {
            $copier = new MultipartCopy(
                $this->s3,
                $sourceBucket.'/'.$sourceKey,
                [
                    'state' => $e->getState(),
                ]
            );

            $errorMsg = $e->getMessage();

            $this->logger->log([
                'event' => 'multipartUploadFailed',
                'level' => 'warning',
                'attempts' => $attempts,
                'error' => $errorMsg
            ]);

            sleep(10 + ($attempts * 10));
        }
    } while (isset($result) !== true && $attempts <= $maxAttempts);

`

Version of AWS SDK for PHP? Example: v3.140.2

Version of PHP (php -v)? 7.3.19-1~deb10u1

To Reproduce (observed behavior) See code snippet above. The issue happens intermittently and requires AWS to fail on at least one part of the multi-part upload. However, even this doesn't guarantee a corrupted file as several "failed part" instances have recovered successfully.

Expected behavior The destination file should be identical to the source file.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context I'm using this library within a Symfony Command worker. The worker waits for an event informing it that a file is ready for copying and, once triggered, performs the copy (along with some other unrelated tasks like getting an MD5 of the source prior to the copy).

I'm using a part size of 1 GB.

It's worth noting that we run 3 of these workers at the same time on the same instance to allow up to 3 transfers to occur simultaneously. I wouldn't think there would be a multi-threading issue with this function, but it seems like a possibility given how intermittent the problem is. Also, in the latest instance of this bug showing up, there were indeed multiple copies happening at the same time. I don't have the data readily available as to whether this was the case in previous occurrences, though.

somnaderi commented 4 years ago

There was a potential source of the problem on my side regarding a process cycling that I do each morning on these workers. This happens at 2am each morning, but processes aren't cycled if they are mid-processing/copying.

One of the 3 failure instances I'm seeing spanned this 2am process cycle event. That process waited until it completed 2 hours later before being cycled. The two other processes that were idling at 2am were cycled as expected.

One of the 2 other failures started at 2:21am, and the second instance of the failure was when the copy started at 7am (5 hours after the processes were restarted).

I don't think these restarts would contribute to the failures, but I'm including them here for completeness.

somnaderi commented 4 years ago

Would anyone have some time to look at this, by chance? Or if there's missing information, I could try and dig it up to help...?

somnaderi commented 4 years ago

Additional data point: I reduced the number of worker processes to 1 apiece -- one for ingesting large files, and one for delivering large files. So far (almost 2 weeks) there have been no additional instances of file corruption.

The ingest and delivery workers operate under different instances of Symfony even though they are on the same server. So I would think that means there's at least some isolation there. That said, in use, ingests usually happen in batches at a separate time of day than deliveries -- so if there IS a cross talk issue, it may not show up here for that reason.

And this is certainly not definitive. But it does at least seem to suggest the problem is related to multiple workers running in parallel. That said, it doesn't really narrow the problem down to the AWS client or the Symfony commands. Another data point here -- I'm using PM2 to run the workers in parallel. So I guess there's 3 variables there -- AWS client, Symfony, and PM2.

Is there anywhere in the S3 client that could be susceptible to problems if running multiple MultipartCopy processes in parallel on the same server?

somnaderi commented 4 years ago

More information that contradicts the earlier finding. We've been running our ingest workers as single processes for a few weeks now. For the first few weeks, there were no incidents of this issue. In the last 2 weeks, there's been 5.

On the plus side, this doesn't appear to be a multi-threading/multi-process issue. On the negative side, well, there still appears to be some kind of issue with the library being able to recover from part copy failures.

Same issue as before -- the copy completes with a destination file that's almost twice the size of the source file.

Any help here would be appreciated.

somnaderi commented 4 years ago

Still haven't heard anything about this. I'd like to provide more data, if possible, to help. Or even debug myself, if I can.

Aside from this -- is there any documentation available as to how I would write my own multi-part copy function to perform the low-level part copies + retry logic + complete transfer behavior? I've only really seen examples of the high level MultipartUpload/MultipartCopy usage. But if there isn't really an appetite to look into issues that may or may not exist here, I could go ahead and write something more low level with a little direction.

SamRemis commented 4 years ago

@somnaderi thank you for bringing this to our attention again, I apologize for the long delay. I will start looking into this right now. It's going to take some time to answer all of the questions, but I will get back to you as soon as I've had a chance to really look into it

somnaderi commented 4 years ago

@SamRemis Much appreciated, Sam.

I'm generally inclined to believe these kinds of problems are implementation (my code) issues and not the library. No one else chiming in as having the problem would also make me feel that way. So if there's any questions about the implementation, I'm more than happy to answer. The only difficult information to gather will be if there's some kind of debug data that needs to be gathered when the failure happens -- as I can't force S3 to trigger a failed API response and I'm only seeing this happen in a production environment where large files are getting copied frequently. Hopefully you're able to rig up some kind of mock/forced failure response.

That said, I guess it's possible not that many people are using MultipartCopy w/ the PHP library for large files.

Anyways, thanks again, and let me know what I can do to help.

SamRemis commented 4 years ago

You may already know most of what I'm about to say, but just wanted to give you some response:

When you call the copy command, it goes and creates a list of individual requests for you: an 'initiateMultiPartUpload' request, one request for 'uploadPart' for each of your parts, and one request to 'completeMultipartUpload'. Each of those individual requests will get retried if they fail up until the number of retries you set (you set it to the default 3 above).

I'm guessing what is happening is that you probably have a very large number of files and occasionally a part fails more than three times, or for some reason the failure goes undetected and a successful response is generated despite a part having failed. Do you think this may be the issue?

If so, when that happens, you don't have to start over like you're doing above- you only have to reupload the parts that have failed.

Another option would be to set the bucket up to automatically remove the failed uploads.

If you want to go beyond increasing the limit or having the failed uploads removed and do something different with the failed parts, I'd recommend using a custom handler or middleware to retry.

somnaderi commented 4 years ago

Hi Sam,

Thanks for the update! Let me confirm that I'm understanding correctly here:

The retries you're referring to would be the overall S3Client "retries" parameter, correct? Or are you referring to the retry loop around the MultipartCopy call?

Assuming you mean the first one, then I would expect the behavior of the above code to look something like this:

S3Client initialize (3 retries) MultipartCopy initialize copier->copy() Loop 1 UploadPart call(s) fail 3 times MultipartCopy failure is logged, but retried as we have not reached maxAttempts copier->copy() Loop 2 UploadPart call(s) fail 3 times MultipartCopy failure is logged, but retried as we have not reached maxAttempts copier->copy() Loop 3 UploadPart call(s) fail 3 times MultipartCopy failure is logged and we exit loop having reached maxAttempts

(quick note, the code in the first post doesn't have an attempts++ increment, but it was added shortly after and we've had several errors since. this doesn't affect the analysis though as I haven't seen anything fail more than once through the copy() loop.)

If this is the case, one thing to note is that errors from Loop 2 or Loop 3 never show up in the logs. The issues I've investigated have logs that look as one would expect if there was a failure on the first pass that was subsequently resolved in the 2nd pass. So the log shows the error from the first pass (what I've posted in the original post), but then a copy completion entry after the 2nd pass.

To clarify some of the questions from your post, these are all happening with single file copies. The single files can be fairly large, though (50-500GB).

So one question I have: is it correct to assume that the underlying uploadPart retry logic is handled by the library? Or would each of those exceptions bubble up to my catch() logic? Given that I've only ever seen a copy trigger a single exception at this level, I'm assuming any of those underlying uploadPart failures are being caught/retried under the hood and then, collectively, when all uploadPart states have reached their final state, a single MultipartUploadException is thrown (and the state from the exception tracks which parts succeeded and which ones failed).

Regarding the proposed solution -- is this an option if I skip using "MultipartCopy" and, instead, perform the initiate/uploadPart/complete steps that you've identified? I imagine this means I'd need to keep track of the states of the various parts and manage the retries of those parts? (this is basically what the MultipartCopy function does, correct?) I'd prefer not to go this route, if possible. If there's a bug identified in the library and the decision is to postpone fixing it, then of course I'd opt to build that logic myself. Ideally we could work towards a fix for the library, though. :)

Regarding the cleanup of failed multipart copies, is there a UI console for viewing what partial uploads exist in S3?

I have a feeling that there's an issue with the library that causes this behavior:

MultipartCopy initialized 999/1000 UploadParts succeed (numbers are an example) 1/1000 UploadPart fails 3 times MultipartCopy throws MultipartUploadException New MultipartCopy is initialized within catch() with the state from the MultipartUploadException object UploadPart statuses are not properly recognized by new MultipartCopy instance Either: a) all 1000 parts are copied again and combined with the first 999 parts, or b) the 1 failed UploadPart is copied again, succeeds, but somehow the final object contains all 1000 objects + the first 999 successful objects duplicated

The results I'm seeing are almost always near duplicate filesize of the original file. Unfortunately, though, the delta is not divisible by the part size (1 GB) nor is it the final "partial" part size (ie the modulus of the original file size % 1GB). But it's consistently within a few GB of double the original.

I'll try and dig up the smallest file I've seen this error on pull it down to see if we can open it up and somehow see if there's any duplication within the file.

Som

SamRemis commented 4 years ago

Hi Som,

There's a few questions here, so I'll address each one individually:

I hope that's helpful. I can play around with it more and get some more info soon

somnaderi commented 4 years ago

Hi Sam,

Thanks again for engaging on this, I really appreciate it.

Regarding the expected behavior, while I haven't found a root cause yet, I can't imagine this is expected behavior. Ultimately -- the file copy signals that the copy is a "success", but really results in an output file that is almost double the source file's size.

Going off the sample code that was provided here (https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/s3-multipart-upload.html), it appears that the usage should be very straightforward. The code I've provided above is effectively this same exact code with the exception being that instead of a potentially infinite do/while() loop, I incorporated a maxAttempts check. But otherwise, it should behave the same.

Based on your response and the documentation provided above, the expected behavior would be:

  1. When 1-to-many UploadParts fail under the hood, those parts would be aggregated and included in a MultipartUploadException object
  2. The state of the MultipartCopy would be available as part of the MultipartUploadException object
  3. The state can then be used to initialize a new MultipartCopy object
  4. The MultipartCopy (ie ->copy()) can be tried again -- this time with the new state as a starting point
  5. The new MultipartCopy will determine which UploadParts were in a failed state and retry delivering only those

I guess, just so that we're on the same page, there are really 2 failures here, btw.

One is S3's APIs failing (500 errors, empty response errors, etc.). From what I can gather, these are "normal" and to be expected and accounted for -- so I think we're on the same page that this isn't the failure we're talking about.

Two is a failure on the client side to account for these errors. The problem is, from my perspective -- and looking at how simple the MultipartCopy implementation code is -- I can't see where my code could be the problem outside of the library. At one point I thought maybe there was a concurrency limitation and maybe that was an implied limitation of the library that I didn't know about -- but since I've removed the concurrency as a variable, the problem still exists. But otherwise, I feel like I'm using the library as intended.

I do have some additional data about the failures. I recently wrapped the code above in a retry loop that checks for the file size of the output to match the input. It's already caught a couple failed copies and was able to retry them and eventually succeed. Here are the errors from my logs (I've removed the identifying details):

2020-09-02T21:13:55.980384-07:00: pid=27991 wyvernUid=1 event=multipartUploadFailed level=warning attempts=0 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:
- Part 4: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=4&uploadId=(removed)\"; AWS HTTP error: cURL error 52: Empty reply from server (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
"
2020-09-02T21:34:50.507508-07:00: pid=27991 wyvernUid=1 event=multipartUploadFailed level=warning attempts=1 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:
- Part 341: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=341&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=341&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 5804: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=5804&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=5804&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 5922: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=5922&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=5922&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 6388: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=6388&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=6388&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>

Something interesting to note here is that the part numbers get up to at least 6388. This file is 144,668,807,572 bytes and the part size was 1GB -- so it shouldn't have needed more than 145 parts in the first place. But I may not be clear on what the library does under the hood or how these numbers are assigned. Hopefully it can provide some insight into where to look, though?

Also -- this is the first time I've seen a case where it took until the 3rd attempt to "succeed" doing a MultipartCopy, and it didn't actually end up succeeding. The subsequent logs show that the file size test failed and the transfer was restarted from scratch. Even on the 2nd copy attempt, it failed again (despite reporting a "success"), and it wasn't until the 3rd restart from scratch that the copy succeed. Below are the errors from the 2nd failed attempt (note, again, the part numbers get up to almost 9000).

2020-09-02T21:45:22.144050-07:00: pid=27991 wyvernUid=1 event=multipartUploadFailed level=warning attempts=0 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:
- Part 113: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=113&uploadId=(removed)\"; AWS HTTP error: cURL error 52: Empty reply from server (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
"
2020-09-02T22:05:31.595674-07:00: pid=27991 wyvernUid=1 event=multipartUploadFailed level=warning attempts=1 error="An exception occurred while uploading parts to a multipart upload. The following parts had errors:
- Part 4136: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=4136&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=4136&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 5246: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=5246&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=5246&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 7574: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=7574&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=7574&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 7801: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=7801&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=7801&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>
- Part 8797: Error executing \"UploadPartCopy\" on \"https://(removed).s3.amazonaws.com/(removed)?partNumber=8797&uploadId=(removed)\"; AWS HTTP error: Server error: `PUT https://(removed).s3.amazonaws.com/(removed)?partNumber=8797&uploadId=(removed)` resulted in a `500 Internal Server Error` response:
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed) (truncated...)
 InternalError (server): We encountered an internal error. Please try again. - <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>(removed)</RequestId><HostId>(removed)</HostId></Error>

I should add one more thing. In my original bug I mentioned that I'd seen the retry logic succeed before. While I'm fairly sure that's the case, I don't remember when I last confirmed that. I think it's worth investigating this as if it is a deterministic bug (under specific failure circumstances) rather than something that randomly happens.

And, once again, I will say if there are other variables here about how the library should be used that may be affecting these results, I'm not ruling out that my implementation could be the problem. But given how simple the Multipart usage appears to be, I'm more and more hard pressed to think this is an implementation issue.

Thanks again for your time with this, Sam!

Som

SamRemis commented 4 years ago

Oh! I misread what you said. There is definitely more to look into here, I apologize for my confusion. It will take a while to mock the error you're seeing, but with that extra info you provided, I should be able to do it.

As for your issue: I'm thinking the increased file size may be due to your try catch loop and having you start two new multipart upload objects when it's failing.

In order to give a customer more flexibility, we allow for part numbers to be non consecutive. For an example, let's say you wanted to copy a file containing the first bit of the alphabet: "abcdefghi".

You could assign part numbers in any ascending order and it would come out the other end intact. It wouldn't matter if you did:

[part 1] - "abc"
[part 2] - "def"
[part 3] - "ghi"

Or something like

[part 1] - "abcdef"
[part 6] - "ghi".

When the complete multipart upload function is called, it will come out correctly either way. My initial guess here is that when you're doing the try/catch, you're kind of doing two multipart uploads in one. It starts numbering the download parts automatically like follows:

[part 1] - "abc"
[part 2] - "def" 
[part 3] - "ghi" //failure and retry happens here; completeMultipartUpload is never called; this part will be missing
[part 4] - "abc" //starting a new multipart upload without having closed the last causes it to continue the last
[part 5] - "def"
[part 6] - "ghi"

Because of the retry not completing the multipart upload and a new one being started on the same file, when the parts are assembled, we end up with a file containing the contents "abcdefabcdefghi" (in order, parts 1, 2, 4, 5 and 6).

If this is indeed the cause, I would want to change the behavior to let a customer know about the failure and allow them to retry on their own. Thank you so much for bringing this to our attention.

somnaderi commented 4 years ago

Ah, some good information here that I didn't know about.

The part number assignment makes sense. Given that I'm not doing the assigning myself, though, what is the behavior of the library when using MultipartCopy? I would have assumed it would be 1 part number per part size -- in my case, 1 part number per 1 GB -- is this not the case?

I ask because the first set of errors I posted in my last comment were from 2 back to back exceptions. This is why the part numbers being the 5000/6000 range don't make sense at first glance -- I'm not assigning the part numbers and I would have thought the library wouldn't get past 1 part number per 1GB (the part size in my implementation). In this case, if I'm expecting 145 parts (the file is 144 GB and change), even a 2nd run through wouldn't see part numbers above 290. Also, if the part number is some kind of singleton incremented over multiple instantiations of MultipartCopy, the second exception in that first group of errors was part number 341 and the last was 6388, and there were no concurrent transfers happening -- so it still wouldn't seem to make sense unless there's more complicated logic here.

What you said about multiple MultipartCopy instantiations accumulating parts would certainly explain the behavior I'm seeing. I should point out that while I'm following the MultipartCopy example from the page I linked to in my last comment, the retry logic was added based on the earlier example on that page for Multipart((Upload)) error recoveries. From what I could tell, the two functions are intended to behave very similarly, though -- so if the error recovery implementation for MultipartCopy is different, that should be documented (though, ideally, the two would have the same behavior).

If that were the problem, what would a solution look like? There doesn't appear to be an "abortMultipartCopy" analog for "abortMultipartUpload". Also, are we sure that aborting/unsetting/deleting the first MultipartCopy object won't go through and delete the underlying UploadParts -- thereby breaking the 2nd MultipartCopy object's ability to retry only the failed parts? I think this is what would need to be tested once the error cases are mocked. Unfortunately it's not something I'd feel comfortable doing on my end as we only see these issues intermittently in our production environment.

Anyways, thanks again for your help, Sam!

somnaderi commented 4 years ago

Hi @SamRemis , just curious if there's any update on this?

Appreciate the help!

SamRemis commented 4 years ago

@somnaderi I still haven't been able to replicate the error, and I've spent a good bit of time on it. Even if I mock errors with retries, without retries, it still comes out properly.
I have one last theory to test- if two copies are created before one completes. I will follow up when I'm done testing that.

One odd thing I saw- multipart copies number their parts starting with 1 and increasing by one each time (for instance [1,2,3,4,5,...,n). Part 8987 would have to be a very large file to have that many parts with your part size.

SamRemis commented 4 years ago

For me, no matter how I initiate one or multiple uploads, it always overwrites the correct part number when the upload is happening. Are you just initiating the uploads from one computer, or is it a distributed system?

somnaderi commented 4 years ago

That's really odd. Are you, by chance, using a different version than I am? I'm on 3.140.2 (Guzzle v.6.5.2). Not sure if there's been any changes since then that may have inadvertently fixed this...? Would you be able to try the same mocked errors with that version to see if it triggers this problem? I could theoretically update to latest, but then we're just waiting potentially a few weeks and watching logs to see if/when it happens again (plus, last time I made a change to attempt to address this, it was about 3 weeks before another error popped up again).

That's interesting about the part numbers -- sounds like it never should get past ~150 in my case (150GB file). Are the part numbers assigned on the client side or on the server side? The numbers in the logs above are all from a single exception. Maybe there's a specific type of error that is triggering retries under the hood that are incrementing part numbers? It couldn't be the 3 retries, though because even then, worst case we'd have 150 x 3 = 450 part numbers, not 8000 or whatever.

The uploads are initiated and run from a single EC2 instance / PHP command script, yes. There are 2 separate processes that run that perform copies -- one for copying files to an internal location, and a separate one that copies to an external location. The two applications are separate services -- each with their own composer/vendor/AWS library/dependencies. I'm reluctant to believe there's any cross-talk going on between the two unless it's happening on the server side for some reason. But, even if there were cross-talk (client or server side) -- there were no other copy processes in progress when these errors were logged. They were the only S3 copies running from that instance at that time.

Knowing now how the part number sequence is intended to work, there definitely appears to be (or was?) a bug in the library then. What other steps can we take to whittle away to a solution?

Potentially related, but these applications are long running PHP commands that can run up to 24 hours before being restarted. They sit and wait for event bus notifications for copy requests. On first request, we create an S3Client object and keep that active for the rest of the life of the process (max of 24 hours). I'm fairly sure I've seen errors at all times of day -- as in, when the client is "fresh" and other times when it's been run several times throughout the day -- but it's possible these errors may only arise after a few copy attempts using the same S3 instantiated object. I'd have to dig into to the logs to see if there are cases of where it happens on the "first try of the day".

So one potential solution here could be to close/reopen the S3 object for each copy. It's also potentially something to add to your test case -- using the same instantiated S3Client object for multiple copies.

howardlopez commented 4 years ago

@somnaderi It's possible the too-high part numbers may be a key to understanding what's going on here.

Just a brief sanity check - are the part numbers from your logs ones that you are generating and incrementing yourself in the before_upload callback, or are they actually being returned by the service?

One thing that could be useful for the investigation is to turn on debugging (client constructor option 'debug' => true) and add the output to your log. The next time this issue occurs, we'll have a much better record of what's actually happening with individual requests and responses. The big cost to this is that this will vastly increase the size of your logs, which may or may not be feasible with your setup, but would be a great help in debugging.

somnaderi commented 4 years ago

Hi Howard,

The "error=" key in the log shows the output from the MultipartUploadException's message ($e->getMessage()). The log pasted a few posts above come from the code included in the original post for "event" => "multipartUploadFailed". Those part numbers were all in the error message.

As far as I can tell, I'm not doing anything to dictate part numbers with before_upload (the function I'm using for this is also in the original post). I have a variable, $partNumber, that's outside the callback's scope that I use to keep track of how many parts have been copied for progress tracking purposes (which I know isn't always going to be accurate due to possibly retry logic, but figured it would be close enough and, based on the successful deliveries I've observed, seems to work well enough).

It appears that variable name ("partNumber") is also used in the MultipartCopy code. I have to imagine the two are in separate scopes and my increment wouldn't affect it...? Not positive without knowing the rest of the AWS code, though.

Something else I hadn't noticed before. Looking at this page (https://docs.aws.amazon.com/aws-sdk-php/v3/api/class-Aws.S3.MultipartUploader.html), it appears the callback signature should have a single (Aws/Command $command) argument -- which I don't have. I wouldn't think this would affect the copy behavior (seeing as how I'm not using anything from $command), but worth noting (and fixing, on our end). Not sure how sensitive PHP is to mismatched arguments in terms of stack alignment and whatnot, but I would think it's pretty resilient or it would throw errors. Let me know if there's any reason you could think of why this may affect the part numbering scheme.

I'll turn on the debugging and see what we can catch. It's possible even a clean run will show some anomalies that'll provide insight. Depending on the size of the logs, we may deploy to production temporarily as well.

Again, appreciate the help!

Som

somnaderi commented 4 years ago

One other thing that dawned on me -- are there any issues that might arise from misconfigured permissions that the library may not bubble up? Transfers succeed most of the time, so the happy path works. I'm wondering if there's some kind of permission that would be needed for resolving retries that maybe isn't as obvious...?

The policy I'm currently using enables the following S3 commands:

s3:GetBucketLocation s3:ListBucketMultipartUploads s3:ListBucket s3:GetBucketLocation s3:PutObject s3:GetObjectAcl s3:GetObject s3:AbortMultipartUpload s3:DeleteObject s3:PutObjectAcl s3:ListMultipartUploadParts

somnaderi commented 4 years ago

@SamRemis @howardlopez We deployed the debug flag a few days ago. Still haven't had a failure happen in production, but I've at least got debug data for a successful delivery.

As far as I can tell, the part numbers in this transfer count up as expected. So we may need to wait until a failure to get anything interesting.

I've tried to sanitize it as best as possible, but I still don't really want to post it here in case there's identifying data in the debug (it's also a reasonable large file). Is it worth you guys checking it out to see if it looks as expected? If so, is there an email for one of you guys where I could email the debug data to?

Thanks,

Som

howardlopez commented 4 years ago

@somnaderi Thanks for setting that up! I think it'll be better to wait for when the issue crops up again, as looking at the logs from properly functioning calls likely won't tell us much. When the issue does occur again, you could potentially also send us a copy of a successful upload log just so we can do an exact comparison of the issue vs. no-issue chain of events on your application.

somnaderi commented 4 years ago

@howardlopez We just got a failure with the debugger turned on. I'm going to pull down the log and sanitize it now -- is there somewhere I can send this to you directly (as I would prefer not to post our customer's potentially identifying information publicly here). Thanks!

somnaderi commented 4 years ago

@howardlopez @SamRemis

Is there some way I can deliver the log files to one of you directly without posting them here publicly? I'm trying to sift through the logs myself and it looks like it would take me a whole lot of coming up to speed on what's going on under the hood to figure out where the problem is. (the log with the errors ended up around 135MB)

Thanks!

Som

howardlopez commented 4 years ago

@somnaderi We can create an S3 bucket that you have write access to deliver the log files, if that works. I'd still recommend sanitizing what you can of the data, particularly anything sensitive or customer-related. Can you leave an email address and we'll contact you with details?

somnaderi commented 4 years ago

@howardlopez Sure! Email is my github name at Google's most popular email domain. :)

howardlopez commented 4 years ago

@somnaderi For both the successful and failure log files that you sent, what's the size of the uploaded asset in question?

somnaderi commented 4 years ago

@howardlopez The successful copy has 5 files, the 2 big ones are ~73GB and ~3GB.

There are 2 failed transfers in the failure log, but both for the same delivery (I originally thought this was 2 separate transfers, but it's actually the same with 2 failures before it "succeeds"). That delivery has 2 files -- ~68GB and ~2GB.

The text to look for for the start of the delivery in the logs are:

event=deliveryPrepare -- contains the manifest being delivered event=startingCopy -- just before copying starts event=multipartUploadFailed -- the 2 failures logged event=multipartUploadComplete -- when copies returned as "successful"

Let me know if there's anything else you need. Thank you!!

howardlopez commented 4 years ago

@somnaderi Thanks for that data.

So far looking through the logs has not yielded too much insight. I haven't seen any evidence that it's the handling of the UploadPart 500 errors that's triggering the strange part behavior, though it's possible it's just not evident in the logs. As an idle thought regarding the incorrect number of parts, the MultipartCopy uses the source size to determine that number, which is derived from the metadata ContentLength value. Is there any way that number could be thrown off by something?

One avenue that might be worth investigating is to replace the MultipartCopy with fetching and uploading the object yourself with MultipartUploader. This may be unfeasible for your project, but thought I'd throw that out there. This could isolate if the issue is indeed localized in the MultipartCopy-specific logic, or if there may be some other issue at play.

somnaderi commented 4 years ago

@howardlopez Thanks for continuing to look into this!

Assuming the question is that the ContentLength metadata of an S3 object can be changed through some external function, I'm fairly sure this wouldn't be happening in this case as we take a headObject() + ContentLength just before the copy to trigger an "overall" retry logic (not the exception handling retry -- instead, when the copy "succeeds", we do a ContentLength check between the source/destination to determine if the copy was really successful and, if not, we do the entire multipartCopy over again.

I wouldn't say the pull down / MultipartUploader logic would be unfeasible, but it may be a little bit before we have a chance to implement it.

I'm not sure this is in this set of logs, but it's worth noting that it's not only the 500 errors that trigger this issue. I've seen it with the "cURL error 52: Empty reply from server" error as well.

One thing I was seeing in the logs that seemed odd was how many connections the transfer appeared to go through in the failed case. In the successful case, it appeared to only need 2-3 connections, but in the failed case the connection appears to close regularly (search for "Connection: close" count balloons up to 100+). Does that seem like normal behavior?

Based on what you've seen in the logs, and seeing the code from the first post that generated those logs, does it seem likely that this is a library bug vs. an implementation bug?

Is there anything else you can think of that I could add -- maybe some kind of state check in the part callback? -- that would be helpful for debugging further?

Thanks again for all your time here! Really hope we can nail down what this is!

Som

SamRemis commented 3 years ago

Hi @somnaderi, Just wanted to give you an update- one of the other SDKs had a similar problem a year or two ago and sent me the internal documentation on it. It's not a perfect solution, but it will at least properly abort a failed download so it is easier to catch and retry it. Thank you so much for your patience on this and working with us- I'll get started on a patch as soon as possible; hopefully that will fix it.
-Sam

somnaderi commented 3 years ago

Hi @SamRemis Thank you (and @howardlopez ) for continuing to look into it! :)

Yeah, I guess there's 2 goals to a fix -- 1) that the process shouldn't claim to be successful when it's really produced a bad copy, and 2) retrying a failed copy results in an efficient determination and retrying of just the failed parts.

It sounds like the patch you're describing would accomplish at least 1, would it also address 2? If so -- wonderful! :)

Out of curiosity, earlier you'd mentioned setting up a mock test and not being able to reproduce the issue. Does the root cause of the other SDK issue shed light enough to understand why your mock set up succeeded where mine continued to fail? The only things I could think of were maybe different SDK/Guzzle versions, but I also wondered if you used the same before_upload callback that I was using and maybe that played into it somehow.

For example, it should be scope protected, though I'm not sure how the callback is called, but I use a variable called "partNumber" in my callback (you can see the code in the first post), which I noticed was used internally in the SDK as well. Seems unlikely (due to scope), but it's the only other thing I could think of that would be a difference between your mock and my implementation.

Anyways, thanks again!!

Som

SamRemis commented 3 years ago

Hi @somnaderi,

I apologize for the delay in responding; we just merged a PR (#2139) that should have it at least throw a proper error when the files are corrupted. If this works, we can look at adding the automatic retries

-Sam

somnaderi commented 3 years ago

Hi @SamRemis,

This is great, thank you! :) We've already got the retry logic in place, too, so hopefully just updating this will mean we won't get the corrupted files at the destination.

In looking at this, was any insight gleaned as to what was causing the errant partNumbers in the first place? Given how no one else appears to have chimed in about this being an issue, I'm figuring a) not a lot of people do large file transfers like this, but also maybe b) maybe a lot of people don't use the before_upload callback.

It makes me wonder if (b) has something to do with why you were unable to reproduce in your mock setup. shrug

Thanks again for the PR!

Som

SamRemis commented 3 years ago

@somnaderi I'm not sure, since this issue was also seen in the C# repository, I'm thinking it probably has to do with a race condition or something service side. It would seem odd if it showed up in multiple SDKs and had to do with our custom logic only.

somnaderi commented 3 years ago

@SamRemis That would make sense, though I'm curious -- I guess I thought the AWS CLI would also perform its function via the same APIs. Is this not the case? The CLI seems to be fairly robust, as we have to see an issue with large files being corrupted during copies since switching over. It still presents problems for our architecture, so the SDK would still be ideal, but still -- it does work.

While I would normally agree about it being odd to have different SDKs producing similar results, I could also see the possibility of a) teams collaborating/sharing code if some capabilities from one SDK were ported from the other, or b) if this maybe requires some nuanced understanding of how the service side works -- because, say, there's something unintuitive about the service itself -- and thereby is a trap that independent engineers could fall into.

Of course, I don't know about how the teams are organized around these services, so that's all speculation. :)

Is it fair to say, given the fix to at least catch failures correctly, that a deeper investigation as to why this isn't working for large files is probably not going to happen any time soon? Just curious how long we may have keep with the CLI approach. While triggering a proper failure is a big step in the right direction, the chance for failure appears to increase with larger files. We're now transferring some 700-800GB files with the CLI without issue, and the only times I'd seen our retry logic max out on attempts was for some 4-500GB examples, which went through 3 full deliveries before failing out completely. Doing a 7-800GB transfer 3+ times due to retry logic starts to have a significant impact in terms of overall workflow times.

Anyways, obviously still really appreciate all the time that has been put into investigating this. Much appreciated!

Som

github-actions[bot] commented 2 years ago

We have noticed this issue has not recieved attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.