Open RitzyMage opened 3 years ago
This is on my to-do list but have not gotten around to it yet so any help would be welcome.
We do not use raw curl but Guzzle, which does support promises.
So on selfoss side, it would be just matter of making spouts’ load
method return a Promise
and then making ContentLoader
’s fetch
method return it as well, and update
method calling Promise->all
(or better, running $n
promises at a time using Each::ofLimit
).
Unfortunately, we do not use Guzzle directly outside for all spouts – RSS goes through SimplePie, which does not support asynchronous fetching and adding that would be a big change so we probably need to fork it. I went ahead and created a fork for selfoss since I want to integrate SimplePieFileGuzzle
anyway.
Yeah I was just looking through the code and discovered the SimplePie thing. I got the project up and running and I'll start working on this. I'm not sure what to do with the RSS spout though.
Well, I think the first step would be promisification of the code base using https://github.com/guzzle/promises#promise (in addition to stuff mentioned above, at least ContentLoader::{fetchIcon,fetchThumbnail}
and spouts\rss\fulltextrss::getContent
also need to be promisified). That should be reasonably simple, even if it the blocking nature of SimplePie and Graby would not allow true parallelization.
Then we could focus on adding async support to Graby and SimplePie. The former should be approachable, renaming Graby::fetchContent
and making it and its dependencies asynchronous (should be fine since Graby uses HTTPlug library which also supports promises).
The SimplePie would be hardest. I would suggest porting it to HTTPPlug first (this is how it was done for Graby) and then adding async support later.
Even the first step would be a minor improvement since it would at least allow parallelization of fetching icons and thumbnails.
It looks like it might be possible to parallelize everything using threads/child preocesses:
Or maybe just use queues and allow running multiple update scripts at a time.
That last stackoverflow question seems like it would work as well as be really easy, so I'm going to try that first.
After further inspection, I realized that any modern version of AMP requires a higher version of PHP; I'll look for something else unless we can update the php version for the project (which we probably can't)
selfoss aims to support Debian oldstable so the next version will support PHP 7.0+, which should work with https://packagist.org/packages/amphp/parallel#v1.2.0
Trying (and technically succeeding) to integrate with React.PHP led me to realize that although I could use it to wrap the items in promises, they waited to resolve because they use blocking calls (https://github.com/reactphp/reactphp/wiki/FAQ). Although we might want to look into ReactPHP, just wrapping the loop with promises won't work.
Which version are you referring to for the 'next version'? 2.19? If so, should we update the composer.json?
If we want to use ReactPHP, we could use the aforementioned php parallel extension through https://github.com/reactphp-parallel/reactphp-parallel which should allow parallelization of blocking calls AIUI. AMP also seems to allow parallelization by running blocking code in separate threads or processes.
Neither of these will work on shared hostings but we can just fall back to serial execution there.
Which version are you referring to for the 'next version'? 2.19? If so, should we update the composer.json?
I mean the version after 2.19 (2.19 will hopefully be released this month.)
Besides requiring php7, AMP parallel seems pretty hard to use (I just tried, but had trouble getting anything to work after a couple of hours). I don't think we need that level of parallelization anyway; we're mostly worried parallelizing I/O. I'll probably try some more stuff (maybe https://github.com/spatie/async and/or the guzzle promises in fulltextrss) tomorrow.
So spatie/async apparently requries the pcntl and posix extensions and getting those installed is non-trivial; I don't think we can expect users to use those
And although I got spatie/async working, it can't serialize the database connection, so it won't be as simple as wrapping it in a for loop
Good point. That will probably be an issue for any parallelization, so we will either have to go with the promisification, or handle the db manipulation in the main thread. The latter sounds messy code organization-wise, unless we decouple the parts using (in-memory or external) queue.
This is interesting: https://wiki.php.net/rfc/fibers
Though it won't be available for a while yet.
Since I have so many sources that selfoss is fetching, refreshing all of them can take quite a while. I'm used to JavaScript, where you can fetch in parallel using Promise.all or callback syntax. Although it's a little more difficult in PHP, it seems possible (https://stackoverflow.com/questions/9308779/php-parallel-curl-requests for example).
I'm willing to work on this issue, but it would be a lot of work and I wanted to make sure that others in the community thought this was a good idea first. Is there any reason why this would be a bad idea?