aws / aws-sdk-php

Official repository of the AWS SDK for PHP (@awsforphp)
http://aws.amazon.com/sdkforphp
Apache License 2.0
6.03k stars 1.22k forks source link

uploadDirectory(,bucket,"") calls lstat("`pwd`/s3://bucket/file...") for every file in the bucket #430

Closed radford closed 9 years ago

radford commented 9 years ago

When the prefix argument to uploadDirectory() is empty the SDK calls lstat() on the s3:// url for every file in the bucket. This can be quite slow when all you want to do is upload a small directory. Interestingly, this does not happen for any file in the bucket if prefix is non-empty.

mtdowling commented 9 years ago

Which version of the SDK are you using?

radford commented 9 years ago

2.6.14

radford commented 9 years ago

2.7.9 as well.

radford commented 9 years ago

It looks like this is an optimization for the case of an empty prefix. But it's only an optimization for large uploads to a relatively empty bucket. When you have a small upload to a large bucket this is slow. For my test case, when the optimization makes sense (small bucket) it's 6x faster and when it doesn't (huge bucket) it's 6x slower.

Either way it seems wrong to be trying to open URLs locally, for which there is zero chance of success, especially when the API has no URLs in it to begin with.

mtdowling commented 9 years ago

I'm not able to reproduce this with the following code:

<?php
require 'vendor/autoload.php';

use Aws\S3\S3Client;

$s3 = S3Client::factory([
    'region' => 'us-east-1',
    'curl.options' => [CURLOPT_VERBOSE => true]
]);

$s3->uploadDirectory(__DIR__ . '/src', 'abcba', '', ['debug' => true]);

Note: I also tried passing null to the $keyPrefix argument of uploadDirectory().

This calls a single GET request for abcba.s3.amazonaws.com, then starts uploading files. The way this function works is that it iterates over the files in the bucket and compares them against the local files to see which files have changed. As the files in the bucket are iterated, you should see more ListObjects related requests for buckets that contain more than 1000 files.

radford commented 9 years ago

Thanks for taking the time to look into this.

Assuming your abcba bucket has many files in it. Run

strace -e lstat php ./upload.php

where upload.php is your script and you should see an lstat(2) call for all the php source files followed by one for every file in the bucket with the following form.

lstat("/home/mtdowling/...blah.../s3://abcba.s3.amazonaws.com/...blah...");

You shouldn't need debug nor any curl.options to see the effect.

mtdowling commented 9 years ago

Ah, I see that in the output of sudo dtruss -f -t lstat64 php test.php. I'll see if I can figure out why that's happening.

It's not causing any extraneous HTTP requests though, so I'm not seeing any performance impact from this.

Edit: Just as a datapoint, I checked v3 of the PHP SDK, and it no longer has this problem.

radford commented 9 years ago

I'm not seeing any performance impact from this.

I see a large performance impact. How long does it take to recursively list your destination bucket? If it's not minutes, then then your test bucket isn't big enough. :)

mtdowling commented 9 years ago

Just curious: how are you measuring the performance impact? Does strace show you the time it takes to make each call?

radford commented 9 years ago

how are you measuring the performance impact?

It takes 10 minutes to list my bucket and 6 seconds to upload my directory of small files with a prefix. If the prefix is "" however then it takes 10 minutes to upload my directory of small files for a slowdown of 100x.

mtdowling commented 9 years ago

I found the issue: we were calling getRealPath() on S3 SplFileInfo objects. This isn't needed and was causing the issue. Can you try out this change (on master) and see if it fixes the performance issue?

radford commented 9 years ago

The fix seems to work; now lstat(2) is no longer called for ever file in the bucket. FWIW, I do still see a call to gettimeofday(2) for every file in the bucket.

Looking at your fix, I'm surprised it wasn't in the StreamWrapper as Calling getRealPath() is still "broken" for anyone who calls it.

I still see 80% CPU utilization for 22s to upload 5 zero sized files to a relatively small bucket and 80% for over 20minutes (and counting; who knows how long it'll take) to my large bucket.

mtdowling commented 9 years ago

I guess the lstat calls weren't causing a significant performance impact (but I think I know what is).

Looking at your fix, I'm surprised it wasn't in the Wrapper as Calling getRealPath() is still "broken" for anyone who calls it.

I'm not sure there's anything I can do about that. Simply calling getRealPath() on an SplFileObject using a custom stream wrapper will trigger the erroneous call. I believe this is because PHP is calling real_path on the custom protocol filename. real_path doesn't think it's an absolute path, so it prepends the current working directory. That's what I can gather from a quick scan of the source (https://github.com/php/php-src/blob/59593ba66c00220f0161a721c80550b2a0305c31/ext/spl/spl_directory.c#L1287 -> https://github.com/php/php-src/blob/fc33f52d8c25997dd0711de3e07d0dc260a18c11/Zend/zend_virtual_cwd.c#L1429).

The system calls to gettimeofday are due to getting the MTime of each file to check if the file needs to be uploaded again... Speaking of which...

If you know your bucket has a ton of files and that you aren't uploading many files, then you could "force" the upload, meaning you upload all of the files regardless of if they changed remotely or not.

$s3->uploadDirectory('/path/to/files', 'mybucket', '', ['force' => true]);

Attempting to determine whether or not a file should be uploaded to the bucket requires that the files in the bucket be iterated over as the local files are iterated. When the file is present locally but not remotely, it is uploaded unconditionally. When they are both present and the local file is newer or has a different size, then the local file is uploaded. This is likely the cause of the slowness and CPU utilization you are experiencing.

The technique we use to for syncing is not foolproof as it can provide false positives. The AWS CLI is the recommended way to do any type of syncing with S3 (that is, if you do not have to use PHP). We actually removed the syncing aspects from v3 of the PHP SDK in favor of simply uploading every file unconditionally (the abstraction accepts an iterator that yields filenames as strings, so this could be extended in the future to add back in syncing, but we would wait on S3 to provide more API hooks to better allow S3 to be treated like a filesystem).

mtdowling commented 9 years ago

I'll ahead and close this. Please let me know if you have any further questions.

radford commented 9 years ago

Thanks for fixing the obvious bug of calling lstat(2) on URLs, but there still remains the different behavior for uploadDirectory() based on prefix. Given that the difference can be so drastic, you might consider documenting the behavior so that others can choose the correct one for their use case. Something like: "if the prefix is "", uploadDirectory() will traverse the entire bucket which can be very inefficient when uploading to large buckets."

mtdowling commented 9 years ago

Good idea. I'll add this suggestion to our backlog.