fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application
https://selfoss.aditu.de
GNU General Public License v3.0
2.36k stars 343 forks source link

Fetch sources in parallel #1249

Open RitzyMage opened 3 years ago

RitzyMage commented 3 years ago

Since I have so many sources that selfoss is fetching, refreshing all of them can take quite a while. I'm used to JavaScript, where you can fetch in parallel using Promise.all or callback syntax. Although it's a little more difficult in PHP, it seems possible (https://stackoverflow.com/questions/9308779/php-parallel-curl-requests for example).

I'm willing to work on this issue, but it would be a lot of work and I wanted to make sure that others in the community thought this was a good idea first. Is there any reason why this would be a bad idea?

jtojnar commented 3 years ago

This is on my to-do list but have not gotten around to it yet so any help would be welcome.

We do not use raw curl but Guzzle, which does support promises.

So on selfoss side, it would be just matter of making spouts’ load method return a Promise and then making ContentLoader’s fetch method return it as well, and update method calling Promise->all (or better, running $n promises at a time using Each::ofLimit).

Unfortunately, we do not use Guzzle directly outside for all spouts – RSS goes through SimplePie, which does not support asynchronous fetching and adding that would be a big change so we probably need to fork it. I went ahead and created a fork for selfoss since I want to integrate SimplePieFileGuzzle anyway.

RitzyMage commented 3 years ago

Yeah I was just looking through the code and discovered the SimplePie thing. I got the project up and running and I'll start working on this. I'm not sure what to do with the RSS spout though.

jtojnar commented 3 years ago

Well, I think the first step would be promisification of the code base using https://github.com/guzzle/promises#promise (in addition to stuff mentioned above, at least ContentLoader::{fetchIcon,fetchThumbnail} and spouts\rss\fulltextrss::getContent also need to be promisified). That should be reasonably simple, even if it the blocking nature of SimplePie and Graby would not allow true parallelization.

Then we could focus on adding async support to Graby and SimplePie. The former should be approachable, renaming Graby::fetchContent and making it and its dependencies asynchronous (should be fine since Graby uses HTTPlug library which also supports promises).

The SimplePie would be hardest. I would suggest porting it to HTTPPlug first (this is how it was done for Graby) and then adding async support later.

Even the first step would be a minor improvement since it would at least allow parallelization of fetching icons and thumbnails.

jtojnar commented 3 years ago

It looks like it might be possible to parallelize everything using threads/child preocesses:

Or maybe just use queues and allow running multiple update scripts at a time.

RitzyMage commented 3 years ago

That last stackoverflow question seems like it would work as well as be really easy, so I'm going to try that first.

RitzyMage commented 3 years ago

After further inspection, I realized that any modern version of AMP requires a higher version of PHP; I'll look for something else unless we can update the php version for the project (which we probably can't)

jtojnar commented 3 years ago

selfoss aims to support Debian oldstable so the next version will support PHP 7.0+, which should work with https://packagist.org/packages/amphp/parallel#v1.2.0

RitzyMage commented 3 years ago

Trying (and technically succeeding) to integrate with React.PHP led me to realize that although I could use it to wrap the items in promises, they waited to resolve because they use blocking calls (https://github.com/reactphp/reactphp/wiki/FAQ). Although we might want to look into ReactPHP, just wrapping the loop with promises won't work.

Which version are you referring to for the 'next version'? 2.19? If so, should we update the composer.json?

jtojnar commented 3 years ago

If we want to use ReactPHP, we could use the aforementioned php parallel extension through https://github.com/reactphp-parallel/reactphp-parallel which should allow parallelization of blocking calls AIUI. AMP also seems to allow parallelization by running blocking code in separate threads or processes.

Neither of these will work on shared hostings but we can just fall back to serial execution there.

Which version are you referring to for the 'next version'? 2.19? If so, should we update the composer.json?

I mean the version after 2.19 (2.19 will hopefully be released this month.)

RitzyMage commented 3 years ago

Besides requiring php7, AMP parallel seems pretty hard to use (I just tried, but had trouble getting anything to work after a couple of hours). I don't think we need that level of parallelization anyway; we're mostly worried parallelizing I/O. I'll probably try some more stuff (maybe https://github.com/spatie/async and/or the guzzle promises in fulltextrss) tomorrow.

RitzyMage commented 3 years ago

So spatie/async apparently requries the pcntl and posix extensions and getting those installed is non-trivial; I don't think we can expect users to use those

RitzyMage commented 3 years ago

And although I got spatie/async working, it can't serialize the database connection, so it won't be as simple as wrapping it in a for loop

jtojnar commented 3 years ago

Good point. That will probably be an issue for any parallelization, so we will either have to go with the promisification, or handle the db manipulation in the main thread. The latter sounds messy code organization-wise, unless we decouple the parts using (in-memory or external) queue.

jtojnar commented 3 years ago

This is interesting: https://wiki.php.net/rfc/fibers

Though it won't be available for a while yet.