RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.37k stars 1.04k forks source link

Massive Slowdown when generating large feed #1857

Closed Church- closed 4 years ago

Church- commented 4 years ago

Describe the bug So I've made a modification to the AO3 bridge to grab the full text of each chapter of a work so I can do reading fully in my RSS reader. I've noticed that compared to a pared down repro script via simpleHTMLDom that could take around 20s to generate a list of chapters, my modification which does effectively the same thing can take anywhere up to 3-5min to grab the text and generate it.

So I'm curious if there's any obvious ways to speed this up?

// Feed for recent chapters of a specific work.
        private function collectWork($id) {
                $url = self::URI . "/works/$id/navigate";
                $html = getSimpleHTMLDOM($url)
                        or returnServerError('could not request AO4');
                $html = defaultLinkTo($html, self::URI);

                $this->title = $html->find('h2 a', 0)->plaintext;

                foreach($html->find('ol.index.group') as $element) {
                        $item = array();

                        $item['title'] = $element->find('a', 0)->plaintext;
                        $item['uri'] = $element->find('a', 0)->href;

                        $pageHtml = getSimpleHTMLDOM($item['uri'])
                            or returnServerError('could not request AO4');

                        $item['content'] = $pageHtml->find('div[class="userstuff module"]', 0)->innertext;

                        $strdate = $element->find('span.datetime', 0)->plaintext;
                        $strdate = str_replace('(', '', $strdate);
                        $strdate = str_replace(')', '', $strdate);
                        $item['timestamp'] = strtotime($strdate);

                        $item['uid'] = $item['uri'] . "/$strdate";

                        $this->items[] = $item;

                }

                $this->items = array_reverse($this->items);
        }
em92 commented 4 years ago

Hi, @Church- ! Use getSimpleHTMLDomCached

Church- commented 4 years ago

@em92 Hey there. :smile:

So how would getSimpleHTMLDomCached help exactly? Thought that just cached the result of a page for re-use when scraping again later?

When I'm having trouble even generating the initial feed without timing out with a 504.

Even if I up the execution time with ini_set('max_execution_time', '3000'); in the bridge it's still liable to time out or have my rss-reader time out trying to get a feed.

Church- commented 4 years ago

Closing this out, found a work around using the feediron plugin in tt-rss

Using this config in feediron:

"archiveofourown.org": {
     type": "xpath",
     "xpath": [
        "div[@class='userstuff module']"
    ]
}

Works wonderfully when scraping chapters of works.