Open holtkamp opened 6 years ago
Great ticket! Actually, cases like this are the only major problems which stop me from recommend using React in a huge production app.
I also noticed things like this, but they appeared in my case when I asked for some quite critical parameters, like 50 or so requests in parallel. Some requests just never return with any response.
This makes me think that probably deep in the core of React we may have some constants or limitations. Or these limitations somewhere in Unix - like number of simultaneously operated core elements.
In any case, if we track how long request is running, we could be able to mark a request as 'lost' and respawn a new one. This reminds me a difference between TCP and UDP protocols :)
@WyriHaximus do you have any suggestion on how to prevent the described "endless" timeouts?
Hey neat project you got there @dpovshed, just took a quick peek at composer.json
and it seems you're using our old monolithing repository. These days we've split everything up over different packages per component: https://reactphp.org/#core-components
We've also updated all our components (a lot) since 0.4, and even created new ones. Like promise-timer
which is useful in this case combined with race()
which lets you race two promises. So assuming you're using @clue's clue/buzz-react
(PSR-7 compatible http client build on react/http-client
with a promise based API) you can do:
race([
Timer\reject(2.0, $loop),
(new Browser($loop))->get($url),
])->then(function () {
/* A successful request ends up here */
}, function () {
/* Any errors end up here as well as when the timer rejects */
});
Great pointers @WyriHaximus ! Thanks! Will try them out locally.
Hey guys, thank you @WyriHaximus for pointing this out! Can restrain myself from silly joke, but those promises looks promising :))
@holtkamp, thank you for heading up on this!
With old core code you almost always can reproduce the problem running a lot of parallel workers, so should be easy to test a new core!
@WyriHaximus thanks for the hints, I updated to the separate components https://github.com/dpovshed/octopus/pull/11 and then adopted the Browser approach for spawning https://github.com/dpovshed/octopus/pull/12
Seems to work quite nice! Will need some extra testing to see whether all functionality still works, but I got confidence this will work!
@holtkamp @dpovshed glad to hear you like it, let me know in case you run into something :+1:
I ran tests twice with intentionally high number of concurrency, so there was a lot of timeouts, server errors, and connection refused cases.
Processed ~21,000 URLs with concurrency 75 , and ~900 URL with concurrency 120.
Both cases has been perfectly handled! This is great, really!!
@dpovshed two things that seem to need some attention:
spanWithBrowser
method does not properly count redirects. For a sitemap with 30 URLs that issue a 302
redirect code:spanWithBrowser()
+-----------------------------+----------------------------+
| Crawling summary for: sitemap-with-302-redirects.xml |
+-----------------------------+----------------------------+
| failed | 200 |
+-----------------------------+----------------------------+
| 0 | 30 |
+-----------------------------+----------------------------+
| Crawling started: 2017-11-15T10:31:25+00:00 |
| Crawling ended: 2017-11-15T10:32:06+00:00 |
| Crawling duration: 41 seconds |
| Applied concurrency: 5 |
| Total amount of processed data: 0 |
| Failed to load #URLs: 0 |
+-----------------------------+----------------------------+
spanWithClient()
+--------------------+--------------------+-------------------+
| Crawling summary for: sitemap-with-302-redirects.xml |
+--------------------+--------------------+-------------------+
| failed | 302 | 200 |
+--------------------+--------------------+-------------------+
| 0 | 30 | 30 |
+--------------------+--------------------+-------------------+
| Crawling started: 2017-11-15T10:27:57+00:00 |
| Crawling ended: 2017-11-15T10:28:02+00:00 |
| Crawling duration: 5 seconds |
| Applied concurrency: 5 |
| Total amount of processed data: 0 |
| Failed to load #URLs: 0 |
+--------------------+--------------------+-------------------+
As described in https://github.com/clue/php-buzz-react, the Browser-based approach automatically follows redirects by default, which is nice. However, this prevents us from collecting HTTP Redirection response codes.
The default behaviour of following redirects can be disabled by configuring the Browser:
$this->browser = new Browser($this->loop);
$this->browser = $this->browser->withOptions([
'followRedirects' => false,
]);
However, when using the Browser and not following the redirect location, it seems the newly added URL is never spawn... Not sure what causes this though
@holtkamp Happy to see the progress here! Not sure I follow what problem you're seeing. If you're seeing a problem with clue/buzz-react, please report upstream and I'm happy to look into this! :+1:
If you're seeing a problem with clue/buzz-react, please report upstream and I'm happy to look into this!
I do not think there is a problem with https://github.com/clue/php-buzz-react when trying to manually "follow" a redirect. I think that from "within" a promise it is not possible to influence the outside world / the target manager?
This is done here: https://github.com/dpovshed/octopus/blob/4460a8371a9b898c9ecc7585a28ce945fcb97008/src/Processor.php#L212-L218
In "some" cases I noticed that the process of fetching URLs from a sitemap halted "indefinitely", effectively keep running the PHP process forever, filling up the terminal with status updates. In the following example I killed the process after 113054 seconds (31 hours), while the last change occured at second 159: a few minutes.
Probably the request became "lost" somewhere in transit, it should time-out?