KeithCu / LinuxReport

Customizable Linux news site based on Python / Flask
https://covidreport.keithcu.com/
GNU Lesser General Public License v3.0
10 stars 2 forks source link

Faster when multiple feeds have expired #11

Closed KeithCu closed 4 years ago

KeithCu commented 4 years ago

On startup, or when multiple feeds have expired, it will sequentially fetch the RSS feeds. That's a little slow when there are 9 or more to fetch, and some sites take 1.5 seconds. Note that most users won't have this problem because it's only 1 request per hour or so. Also, there is a little jitter to spread out the requests so it's unusual to have more than a few fetches.

It would be faster to switch to multi-process or multi-threading to allow multiple fetches to happen at the same time.

It would be simplest to use multi-process, but that could mean that each of the ~10 Python engines that respond to Apache requests would probably have their own pool of 2-5 processes sitting around.

It could also be sped up by creating multiple threads which Python now supports.

Ideally it would be done in an async way. You could queue up 2-9 requests async with one thread, which would mostly be sitting for .5 to 1.5 seconds waiting for a response.

I think creating a pool of Python threads is the best solution here because they are lighter weight than Linux threads and obviously processes, and the logic is very simple.

Because this uses a file system cache, a solution using either processes or threads should work.

KeithCu commented 4 years ago

Here's a library that could be helpful: https://trio.readthedocs.io/en/stable/

KeithCu commented 4 years ago

This might be easy enough: https://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor

KeithCu commented 4 years ago

Another way to implement this, would be to split off feed fetching as another service, that would just be sleeping and periodically seeing if any of the URLs are out of date, and and fetching and updating them.

One challenge is that the feed is currently saved in a RAM drive in /tmp, which under systemd is setup as private tmp. This means another service wouldn't be able to access to it. I would have to move it to another directory which can be shared, and hopefully also can be stored in RAM.

Update: systemd lets you join namespaces between two services with "JoinsNamespaceOf'. That would let them share the private tmp directory.

KeithCu commented 4 years ago

I've written an implementation using concurrent.futures and a global threadpool. Now the page worst case time should be the time for the slowest feed, instead of the time to fetch all needed feeds.

KeithCu commented 4 years ago

Here's a test run:

Serially it would take 23 seconds, but with a 10 worker temporary threadpool, it takes 4.


Parsing from remote site http://feeds.feedburner.com/linuxtoday/linux in 0.442090. Parsing from remote site http://lwn.net/headlines/newrss in 1.406546. Parsing from remote site http://rss.slashdot.org/Slashdot/slashdotMain in 1.419237. Parsing from remote site http://www.osnews.com/feed/ in 1.402242. Parsing from remote site http://news.ycombinator.com/rss in 1.500885. Parsing from remote site http://lxer.com/module/newswire/headlines.rss in 1.624353. Parsing from remote site https://www.reddit.com/r/Coronavirus/rising/.rss in 1.667059. Parsing from remote site http://www.reddit.com/r/linux/.rss in 1.744879. Parsing from remote site https://www.google.com/alerts/feeds/12151242449143161443/16985802477674969984 in 1.117016. Parsing from remote site http://www.geekwire.com/feed/ in 3.537120. Parsing from remote site http://planet.debian.org/rss20.xml in 3.734472. Parsing from remote site http://www.independent.co.uk/topic/coronavirus/rss in 3.404218.

Fetched all feeds in 4.016407 sec.