Endless waiting for response causes process to run without ending

holtkamp commented 6 years ago

In "some" cases I noticed that the process of fetching URLs from a sitemap halted "indefinitely", effectively keep running the PHP process forever, filling up the terminal with status updates. In the following example I killed the process after 113054 seconds (31 hours), while the last change occured at second 159: a few minutes.

Probably the request became "lost" somewhere in transit, it should time-out?


  22.0MB 158.46 sec. Queued/running/done: 8/5/3447. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3312 302: 135           
  22.0MB 158.72 sec. Queued/running/done: 6/5/3449. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3314 302: 135           
  22.0MB 158.97 sec. Queued/running/done: 4/5/3451. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3316 302: 135           
  22.0MB 159.22 sec. Queued/running/done: 2/5/3453. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3318 302: 135           
  22.0MB 159.47 sec. Queued/running/done: 0/4/3456. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3321 302: 135           
  22.0MB 159.72 sec. Queued/running/done: 0/2/3458. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3323 302: 135           
  22.0MB 159.97 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135   
...
...
  22.0MB 113053.19 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113053.44 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113053.69 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113053.94 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113054.19 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113054.44 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113054.69 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135           
  22.0MB 113054.94 sec. Queued/running/done: 0/1/3459. Statistics: failed: 0 CF-Cache-Status (HIT): 3189 200: 3324 302: 135

dpovshed commented 6 years ago

Great ticket! Actually, cases like this are the only major problems which stop me from recommend using React in a huge production app.

I also noticed things like this, but they appeared in my case when I asked for some quite critical parameters, like 50 or so requests in parallel. Some requests just never return with any response.

This makes me think that probably deep in the core of React we may have some constants or limitations. Or these limitations somewhere in Unix - like number of simultaneously operated core elements.

In any case, if we track how long request is running, we could be able to mark a request as 'lost' and respawn a new one. This reminds me a difference between TCP and UDP protocols :)

holtkamp commented 6 years ago

@WyriHaximus do you have any suggestion on how to prevent the described "endless" timeouts?

WyriHaximus commented 6 years ago

Hey neat project you got there @dpovshed, just took a quick peek at composer.json and it seems you're using our old monolithing repository. These days we've split everything up over different packages per component: https://reactphp.org/#core-components

We've also updated all our components (a lot) since 0.4, and even created new ones. Like promise-timer which is useful in this case combined with race() which lets you race two promises. So assuming you're using @clue's clue/buzz-react (PSR-7 compatible http client build on react/http-client with a promise based API) you can do:

race([
    Timer\reject(2.0, $loop),
    (new Browser($loop))->get($url),
])->then(function () {
    /* A successful request ends up here */
}, function () {
    /* Any errors end up here as well as when the timer rejects */
});

holtkamp commented 6 years ago

Great pointers @WyriHaximus ! Thanks! Will try them out locally.

dpovshed commented 6 years ago

Hey guys, thank you @WyriHaximus for pointing this out! Can restrain myself from silly joke, but those promises looks promising :))

@holtkamp, thank you for heading up on this!

With old core code you almost always can reproduce the problem running a lot of parallel workers, so should be easy to test a new core!

holtkamp commented 6 years ago

@WyriHaximus thanks for the hints, I updated to the separate components https://github.com/dpovshed/octopus/pull/11 and then adopted the Browser approach for spawning https://github.com/dpovshed/octopus/pull/12

Seems to work quite nice! Will need some extra testing to see whether all functionality still works, but I got confidence this will work!

WyriHaximus commented 6 years ago

@holtkamp @dpovshed glad to hear you like it, let me know in case you run into something :+1:

dpovshed commented 6 years ago

I ran tests twice with intentionally high number of concurrency, so there was a lot of timeouts, server errors, and connection refused cases.

Processed ~21,000 URLs with concurrency 75 , and ~900 URL with concurrency 120.

Both cases has been perfectly handled! This is great, really!!

holtkamp commented 6 years ago

@dpovshed two things that seem to need some attention:

the Timeout promise is 'always' executed, after crawling all URLs it still takes some time before the summary is displayed, see the difference in crawl duration below.
it seems the new spanWithBrowser method does not properly count redirects. For a sitemap with 30 URLs that issue a 302 redirect code:

spanWithBrowser()

+-----------------------------+----------------------------+
| Crawling summary for: sitemap-with-302-redirects.xml     |
+-----------------------------+----------------------------+
| failed                      | 200                        |
+-----------------------------+----------------------------+
| 0                           | 30                         |
+-----------------------------+----------------------------+
| Crawling started: 2017-11-15T10:31:25+00:00              |
| Crawling ended: 2017-11-15T10:32:06+00:00                |
| Crawling duration: 41 seconds                            |
| Applied concurrency: 5                                   |
| Total amount of processed data: 0                        |
| Failed to load #URLs: 0                                  |
+-----------------------------+----------------------------+

spanWithClient()

+--------------------+--------------------+-------------------+
| Crawling summary for: sitemap-with-302-redirects.xml        |
+--------------------+--------------------+-------------------+
| failed             | 302                | 200               |
+--------------------+--------------------+-------------------+
| 0                  | 30                 | 30                |
+--------------------+--------------------+-------------------+
| Crawling started: 2017-11-15T10:27:57+00:00                 |
| Crawling ended: 2017-11-15T10:28:02+00:00                   |
| Crawling duration: 5 seconds                                |
| Applied concurrency: 5                                      |
| Total amount of processed data: 0                           |
| Failed to load #URLs: 0                                     |
+--------------------+--------------------+-------------------+

holtkamp commented 6 years ago

As described in https://github.com/clue/php-buzz-react, the Browser-based approach automatically follows redirects by default, which is nice. However, this prevents us from collecting HTTP Redirection response codes.

The default behaviour of following redirects can be disabled by configuring the Browser:

$this->browser = new Browser($this->loop);
$this->browser = $this->browser->withOptions([
       'followRedirects' => false,
]);

However, when using the Browser and not following the redirect location, it seems the newly added URL is never spawn... Not sure what causes this though

clue commented 6 years ago

@holtkamp Happy to see the progress here! Not sure I follow what problem you're seeing. If you're seeing a problem with clue/buzz-react, please report upstream and I'm happy to look into this! :+1:

holtkamp commented 6 years ago

If you're seeing a problem with clue/buzz-react, please report upstream and I'm happy to look into this!

I do not think there is a problem with https://github.com/clue/php-buzz-react when trying to manually "follow" a redirect. I think that from "within" a promise it is not possible to influence the outside world / the target manager?

This is done here: https://github.com/dpovshed/octopus/blob/4460a8371a9b898c9ecc7585a28ce945fcb97008/src/Processor.php#L212-L218

dpovshed / octopus

Endless waiting for response causes process to run without ending #10