feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
204 stars 34 forks source link

Recursive multipage fetch error #201

Closed rookiejet closed 4 months ago

rookiejet commented 4 months ago

After #199 recursive multipage option fails for the following recipe

{
    "name": "arstechnica.com",
    "url": "arstechnica.com",
    "stamp": 1714614362,
    "author": "",
    "match": "arstechnica.com",
    "config": {
        "type": "xpath",
        "xpath": "div[contains(@class, 'article-content')]",
        "multipage": {
            "xpath": "nav[contains(@class, 'page-numbers')]\/span\/a[last()]",
            "append": true,
            "recursive": true
        },
        "modify": [
            {
                "type": "regex",
                "pattern": "\/<li.*? data-src=\"(.*?)\".*?>\\s*<figure.*?>.*?(?:<figcaption.*?<div class=\"caption\">(.*?)<\\\/div>.*?<\\\/figcaption>)?\\s*<\\\/figure>\\s*<\\\/li>\/s",
                "replace": "<figure><img src=\"$1\"\/><figcaption>$2<\/figcaption><\/figure>"
            }
        ],
        "cleanup": [
            "aside",
            "div[contains(@class, 'sidebar')]"
        ]
    }
}

Fetches for multipage articles using that recipe now error out with a memory exhaustion error.

[00:52:14/84489] guid 2,https://arstechnica.com/?p=2021143 (hash: {"ver":2,"uid":2,"hash":"SHA1:c9da0973a1
d72d22dc9e1d1c8a92524a07d0a847"} compat: SHA1:63f3a3eb0ea7ce00279023a4f8ff530c6350776f)
[00:52:14/84489] orig date: 1714601978 (2024-05-01 22:19:38)
[00:52:14/84489] title All the ways streaming services are aggravating their subscribers this week
[00:52:14/84489] link https://arstechnica.com/?p=2021143
[00:52:14/84489] language en
[00:52:14/84489] author Scharon Harding
[00:52:14/84489] looking for tags...
[00:52:14/84489] tags found: ads, comcast, nbcuniversal, netflix, peacock, sony, streaming, tech, warner bros. discovery
[00:52:14/84489] done collecting data.
[00:52:14/84489] looking for enclosures...
[00:52:14/84489] article hash: 0d2ba6fe8dff3f0d87a7174e8a00398543d691b1 [stored=]
[00:52:14/84489] hash differs, running HOOK_ARTICLE_FILTER handlers...
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/?p=2021143
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/2/
[00:52:14/84489] [UrlHelper] fetching: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/3/
PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 4096 bytes) in /srv/www/plugins.local/feediron/bin/fi_helper.php on line 67

I reverted the change to this line in commit https://github.com/feediron/ttrss_plugin-feediron/pull/199/commits/d0cd6a052cb78a96f40a366273ce5a3a765a6e95 and the above recipe is working again.

Article tested: https://arstechnica.com/gadgets/2024/05/all-the-ways-streaming-services-are-aggravating-their-subscribers-this-week/

dugite-code commented 4 months ago

Ooof I did not catch that. Hope I'll be able to sit down this weekend to look at it fairly sure it's a matter for removing https://github.com/feediron/ttrss_plugin-feediron/blob/a586a2a96ceb4760d3c2e74d6968129f1b6caca8/init.php#L471 and iterating the maxpages counter.

@monofox any thoughts?

monofox commented 4 months ago

I can reproduce it. But looking deeper into that part of links array, is fairly interesting. I'll try to understand the algorithm tomorrow evening. Because in this case, the link is not reformatted, as the option should be disabled per configuration. Just for you reference. Replacing line https://github.com/feediron/ttrss_plugin-feediron/blob/a586a2a96ceb4760d3c2e74d6968129f1b6caca8/init.php#L457 and reverting with non-referencing $lnk, will solve. But then the $links array have different value in case of executing reformat. Thats what i mean with understanding the correct algorithm.

monofox commented 4 months ago

@dugite-code : i've made a proposal in #202

Please let me know your opinion on this.

What i've faced during testing the different scenarios, that the counter and the links will not proper contain the right values while iterating in recursive mode. The counter is only increased in a next call. So if a page contains 4 links, all subsequent 4 links are considered as page 2. This might be an issue in proper taking care of $maxpages setting. Due to replacing, than merging and extending the $links-Array, the seenlinks is never containing all subsequent previous links, as the its replaced with its own. But this is not relevant here for solving this.