aws / aws-sdk-php

Official repository of the AWS SDK for PHP (@awsforphp)
http://aws.amazon.com/sdkforphp
Apache License 2.0
6.02k stars 1.22k forks source link

How to send truly asynchronous requests? #621

Closed Briareos closed 9 years ago

Briareos commented 9 years ago

I need to analyze a large number of files on S3; whose keys i read from a database.

What I want is to dynamically add promises to the pool and make it so at any given moment there's 30 ongoing requests; that is, each fulfilled promise would immediately queue up another request. What I managed to achieve is to send 30 requests and batch another 30 afterwards, but that's not quite there yet. I know Guzzle is perfectly capable of doing so, and I managed myself around during the SDK Alpha (that we still use in production), but since Guzzle 6 and so many recent changes we're kinda overwhelmed.

I attempted something like this:


$s3Client = '...';
$i        = 0;

$yieldAsyncCommand = function () use ($s3Client, &$i) {
    return $s3Client->executeAsync($this->s3->getCommand('GetObject', [
        'Key'    => sprintf('test%s.txt', $i),
        'Bucket' => $this->bucket,
        '@http'  => [
            'sink' => sprintf('test%s.txt', $i++),
        ],
    ]));
};

$promises = new \ArrayIterator();
$promises->append($yieldAsyncCommand());

$options = [
    'fulfilled' => function (ResultInterface $result) use ($yieldAsyncCommand, $promises) {
        // Queue up next request
        $promises->append($yieldAsyncCommand());
    },
    'rejected'  => function ($reason) use ($yieldAsyncCommand, $promises) {
        // Log this and queue up next request
        $promises->append($yieldAsyncCommand());
    },
];

$each = new EachPromise($promises, $options);
$each->promise()->wait();

and it does queue up requests, but they get sent only after the whole previous batch finishes. I'm kinda stuck here, so any help would be much appreciated.

adamlc commented 9 years ago

I'm trying to do something similar too. I'm trying to get everything to run within a React event loop

Briareos commented 9 years ago

@adamlc That was my first attempt too, but I was disappointed when I noticed that all async command statuses are pending and you need to call the blocking function Promise::wait() for them to start processing.

adamlc commented 9 years ago

@Briareos You're right, I've just come across that at the moment, very disappointing indeed :(

I've also tried adding them to the guzzle promise queue and calling run on a periodic timer, as per Guzzle's docs, but that also seems to be blocking, which kind of defeats the whole purpose!

mtdowling commented 9 years ago

I think there are some really good questions in this thread. I think I see three separate questions:

  1. You're observing a sort of batching behavior and would like help.
  2. You want to know how to send async requests with the SDK that don't require a call to wait.
  3. You want to use the SDK with React.

So...

1. You're observing a sort of batching behavior and would like help

Your example would be blocking between calls because you are not queuing up multiple requests-- you're only queuing more requests after a request completes. Think of the CommandPool and EachPromise abstractions as a pipeline to transfer requests (or commands). It needs an iterator that yields promises and will ensure that N number of promises are in flight at any given time.

Let's make some modifications to your example to make it send the requests concurrently...

<?php
require 'vendor/autoload.php';

use Aws\Sdk;
use GuzzleHttp\Promise;
use GuzzleHttp\Handler\CurlMultiHandler;
use GuzzleHttp\HandlerStack;

$sdk = new Sdk(['region' => 'us-east-1', 'version' => 'latest']);
$s3Client = $sdk->createS3();
$bucket = 'my-bucket';

$promiseGenerator = function ($total) use ($s3Client, $bucket) {
    for ($i = 0; $i < $total; $i++) {
        yield $s3Client->getObjectAsync([
            'Key'    => sprintf('test%s.text', $i),
            'Bucket' => $bucket,
        ]);
    }
};

$fulfilled = function($result) {
    echo 'Got result: ' . var_export($result->toArray(), true) . "\n\n";
};

$rejected = function($reason) {
    echo 'Rejected: ' . $reason . "\n\n";
};

// Create the generator that yields 1000 total promises.
$promises = $promiseGenerator(1000);
// Create a promise that sends 50 promises concurrently by reading from
// a queue of promises.
$each = Promise\each_limit($promises, 50, $fulfilled, $rejected);
// Trigger a wait. Note that if you use an event loop then this is not
// necessary.
$each->wait();

Note that there is also a CommandPool and several examples on sending concurrent requests: http://docs.aws.amazon.com/aws-sdk-php/v3/guide/guide/commands.html#commandpool

2. You want to know how to send async requests with the SDK that don't require a call to wait.

The SDK is designed to be able to work with any sort of HTTP client. Some clients might use an event loop that you can tick externally, while others might require you to call wait.

The SDK will use cURL or the PHP stream wrapper by default if you do not configure a custom HTTP handler for the SDK. This is accomplished by using Guzzle by default. However, keep in mind that you can use any HTTP client with the SDK (more on that later).

If you are using the PHP stream wrapper, then there's no way to send the requests other than to call wait. This is because the PHP stream does not allow concurrent requests.

When using cURL, you would need to tick the cURL loop in order to asynchronously progress the transfers. Most non-blocking event loops require that they are ticked to progress the transfers. Using a cURL handler with the SDK is no different. Here's how you could use a Guzzle handler that is coupled to cURL to manually tick the cURL event loop:

<?php
require 'vendor/autoload.php';

use Aws\Sdk;
use GuzzleHttp\Promise;
use GuzzleHttp\Handler\CurlMultiHandler;
use GuzzleHttp\HandlerStack;

$curl = new CurlMultiHandler();
$handler = GuzzleHttp\HandlerStack::create($curl);
$sdk = new Sdk([
    'http_handler' => $handler,
    'region' => 'us-west-2',
    'version' => 'latest',
]);

$client = $sdk->createS3();
$p1 = $client->listBucketsAsync()->then(function () { echo '-done 1-'; });
$p2 = $client->listBucketsAsync()->then(function () { echo '-done 2-'; });
$aggregate = Promise\all([$p1, $p2]);

// Tick the curl loop manually.
while (!Promise\is_settled($aggregate)) {
    $curl->tick();
}

If you write HTTP handlers for other clients that use an event loop that is ticked automatically (because you are calling run or something on an event loop), then manually ticking an event loop or calling wait is unnecessary.

Here are a couple examples of creating custom HTTP handlers for the SDK: https://github.com/aws/aws-sdk-php/tree/master/src/Handler. You could create a custom handler to bind the SDK to an event loop of your choice. Here is more information on SDK handlers: http://docs.aws.amazon.com/aws-sdk-php/v3/guide/guide/handlers-and-middleware.html#creating-custom-handlers

3. You want to use the SDK with React.

I've also tried adding them to the guzzle promise queue and calling run on a periodic timer, as per Guzzle's docs, but that also seems to be blocking, which kind of defeats the whole purpose!

Yes, adding a Guzzle client that uses cURL into a React event loop would be blocking. This is because you are using two different event loops that do not cooperate with one another. You will need to use a React HTTP handler with Guzzle in order to send non-blocking requests when injecting Guzzle into the React event loop. There is a promising start to a Guzzle handler here: https://github.com/WyriHaximus/react-guzzle-psr7 (@WyriHaximus is doing a fantastic job).

We haven't checked to see if this handler is in a state that it will work with the SDK, but eventually the goal is that it will. In order to configure Guzzle to use this adapter and then to configure the SDK to use a specific Guzzle client, you would essentially do what I showed in the above cURL example, but instead you would use the React handler instead of cURL.

WyriHaximus commented 9 years ago

@mtdowling thanks for the compliments :+1: . Like to point out that making the handler work with the SDK is one of the reasons I started working on it (back in the Guzzle v4 era (it's a long story)). My main focus is to get it following the Guzzle handler specs and requirements and then start making sure it works well with the SDK.

jeremeamia commented 9 years ago

Nice writeup, @mtdowling, and nice work on the React handler, @WyriHaximus.

adamlc commented 9 years ago

Fantastic! Thanks for explaining that @mtdowling :smiley:

Briareos commented 9 years ago

Hey @mtdowling, thanks for the great explanation!

There is one unexpected behavior that I noticed in the first case - the promises are resolved only after a whole batch finishes. There is indeed an N number of promises in flight; but not requests. I'll try to illustrate with Promise\each_limit($promises, 3, $success, $fail) (with [==] representing request duration):

Pipe 1: [===]            $success() [==============] $success() [==...
Pipe 2: [=====]          $success() [======]         $success() [==...
Pipe 3: [==============] $success() [========]       $success() [==...

What I expected to get is:

Pipe 1: [===] $success() [==============] $success() [==...
Pipe 2: [=====] $success() [======] $success() [==...
Pipe 3: [==============] $success() [========] $success() [==...
mtdowling commented 9 years ago

Oh, interesting. Can you show the code you used to determine inflight requests vs inflight promises?

On Jun 12, 2015, at 2:40 AM, Milos Colakovic notifications@github.com wrote:

Hey @mtdowling, thanks for the great explanation!

There is one unexpected behavior that I noticed in the first case - the promises are resolved only after a whole batch finishes. There is indeed an N number of promises in flight; but not requests. I'll try to illustrate with Promise\each_limit($promises, 3, $success, $fail) (with [==] representing request duration):

Pipe 1: [===] $success() [==============] $success() [==... Pipe 2: [=====] $success() [======] $success() [==... Pipe 3: [==============] $success() [========] $success() [==... What I expected to get is:

Pipe 1: [===] $success() [==============] $success() [==... Pipe 2: [=====] $success() [======] $success() [==... Pipe 3: [==============] $success() [========] $success() [==... — Reply to this email directly or view it on GitHub.

Briareos commented 9 years ago

Unfortunately I don't have a clear way of isolating the problem. I noticed however different behavior in Guzzle 5 and 6. Guzzle 5 behaves as described above, and Guzzle 6 works as expected (it doesn't block). Still, in both it's kinda random regarding how many parallel tasks it's processing at the same time; ie. if I say 2 parallel tasks, it will sometimes process 2 and other times 3 (it's always +1).

What I did to discover the problem is run my script, which is pretty much your example no.1:

git clone https://github.com/Briareos/aws-test.git
cd aws-test
composer install
php test.php

And in another terminal to observe the downloading files:

watch -n.1 'ls -l aws-test/out'

It will download files in parallel, but differently in Guzzle 5 and Guzzle 6. I will investigate it more next week.

adamlc commented 9 years ago

@mtdowling just to let you know that WyriHaximus/react-guzzle-psr7 works prefectly with the SDK! Thanks @WyriHaximus :dancer:

Briareos commented 9 years ago

Hey @mtdowling, just wanted to let you know that https://github.com/guzzle/promises/pull/5 fixes the issue regarding the number of tasks that are processed in parallel, that I mentioned in the post above.

The issue that persists is only in Guzzle 5, that still behaves like I described in https://github.com/aws/aws-sdk-php/issues/621#issuecomment-111430093 I won't be investigating the issue any more because I've decided to migrate to Guzzle 6.

mtdowling commented 9 years ago

Glad to hear that fixed it. Guzzle 5 is architected quite differently and would probably require non-trivial changes to make it work identically to v6. I don't think there's a big need to update v5 to match v6 here considering v6 is available. We'll tag the related libraries and make sure they get pulled into the next release.

@adamlc awesome! We'll try to do more testing and make sure everything works as expected. As soon as we are sure, we'll start promoting that integration point.