feediron / feediron-recipes

Recipes for the TT-RSS Plugin Feediron
MIT License
13 stars 10 forks source link

Trouble with XPath expression, how to get the 2nd element? #7

Closed uqs closed 2 years ago

uqs commented 4 years ago

Expected Behavior

The arstechnica.com recipe is broken and/or they changed their site layout so that the extracted element is often just half of the content.

Recipe Code

Please help provide information about the recipe.

{
    "name": "arstechnica.com",
    "url": "arstechnica.com",
    "stamp": 1470889961,
    "author": "cwmke",
    "match": "arstechnica.com",
    "config": {
        "type": "xpath",
        "xpath": "div[contains(@class, 'article-content')]",
        "multipage": {
            "xpath": "nav[contains(@class, 'page-numbers')]\/span\/a[last()]",                                                                                  
            "append": true,
            "recursive": true
        },  
        "modify": [
            {
                "type": "regex",
                "pattern": "\/<li.*? data-src=\"(.*?)\".*?>\\s*<figure.*?>.*?(?:<figcaption.*?<div class=\"caption\">(.*?)<\\\/div>.*?<\\\/figcaption>)?\\s*<\\\/figure>\\s*<\\\/li>\/s",
                "replace": "<figure><img src=\"$1\"\/><figcaption>$2<\/figcaption><\/figure>"
            }   
        ],  
        "cleanup": [
            "aside",
            "div[contains(@class, 'sidebar')]"
        ]   
    }   
}   

Context

Ignore the modify regex, that is not the problem. I've only this example article at hand and this is not supposed to be a political statement or anything (I'm just curious what all this Impostor stuff is actually about)

https://arstechnica.com/gaming/2020/10/aocs-twitch-streaming-debut-attracts-over-435000-among-us-viewers/

Run that article through the filter, and you'll notice that the bottom half of the article is missing.

The article structure is roughly like so:

<article>
  <div> <div> <section class="article-guts> <div class="article-content post-page> </div></div></div>
  <some ad stuff in here>
  <div> <div> <section class="article-guts> rest of article in here </div></div></div>
</article

The filter grabs the first article-content and runs with it. So I changed it to:

    "xpath": [
        "div[contains(@class, 'article-content')]",
        "(//section[@class='article-guts'])[1]"
    ],

Because in Chrome, I can select it in the console using: $x("//section[@class='article-guts']")[1] But in feediron, this results in all content getting dropped (and then the fallback to displaying the full HTML).

I'm confused as to how XPath works and how it works in Feediron and whether it would concatenate 2 expressions or whatever. Just running with the single filter of: "section[@class='article-guts'][last()]" results in, you guessed it, the first article-guts content getting displayed, not the 2nd or last one.

Help? Does feediron extract both XPaths and concatenates them? How can I get it to extract both article-guts classes? Why does it think the forward slashes need to be escaped and re-writes them?

dugite-code commented 3 years ago

Sorry for the late reply github is really terrible at notifying me when an issue is opened.

The xpath filter only grabs a single index/occurrence in order to join them you need to manually select each instance.

If you want to select all the instances of an xpath you need to use the all_xpath filter. Note: I've marked it experimental only because I don't use it personally: https://github.com/feediron/ttrss_plugin-feediron/tree/master/filters/fi_mod_all_xpath