feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
204 stars 34 forks source link

Add xpath all items switch to tags filter #179

Closed benfishbus closed 9 months ago

benfishbus commented 2 years ago

I hope this is not a really stupid question, but I just encountered a site that provides some tags that ttrss is not pulling in. I want to use the tags filter to import them, but it's not clear to me where it belongs in the config. I've only used the xpath and readability filters before.

Is the tags filter even applied per domain/URL, or is it global?

I searched for example recipes and could not find any...

dugite-code commented 2 years ago

Is the tags filter even applied per domain/URL, or is it global?

Tag fetching is a per url filter - see the documentation here: https://github.com/feediron/ttrss_plugin-feediron#tags-filter

A full config example is like this:

{
        "type": "xpath",
        "tidy-source": true,
        "xpath": [
            "div[contains(@class,'entry-content')]"
        ]
        "tags": {
            "type": "xpath",
            "replace-tags":true,
            "xpath": [
                "p[@class='topics']\/\/text()"
            ],
            "split":","
        }
}

Tag filtering behaves very similar to the xpath filtering but it attempts to strip out any remaining html to get plain text. for best results it's best to xpath directly to body text if possible as in the example above.

benfishbus commented 2 years ago

I'm still learning xpath. How would I select all of the tags in this html:

<div class="grid-ten grid-prepend-two large-grid-nine grid-last content-topics topic-list">
      <i class="icon-tag"></i>
      <ul>
          <li class="topic-list-item">
            <a href="/topics/censorship-208">Censorship</a>
          </li>
          <li class="topic-list-item">
            <a href="/topics/bbc-3648">BBC</a>
          </li>
          <li class="topic-list-item">
            <a href="/topics/broadcasting-4655">Broadcasting</a>
          </li>
          <li class="topic-list-item">
            <a href="/topics/nuclear-10888">Nuclear</a>
          </li>
          <li class="topic-list-item">
            <a href="/topics/harold-wilson-16241">Harold Wilson</a>
          </li>
          <li class="topic-list-item">
            <a href="/topics/nuclear-war-17464">Nuclear war</a>
          </li>
      </ul>
    </div>

No matter what I try, all I get is the first one.

benfishbus commented 2 years ago

"div[contains(@class, 'topic-list')]\/\/ul" gives me one giant tag that I can't split "li[@class='topic-list-item']"gives me only the first tag "\/\/a[@href[contains(.,'\/topics\/')]]" also gives me only the first tag

I don't want to resort to regex unless I have to.

dugite-code commented 2 years ago

@benfishbus The tag's filter hooks into the basic xpath filter so it's stuck with the same limitations

Editing this line https://github.com/feediron/ttrss_plugin-feediron/blob/master/filters/fi_mod_tags_xpath/init.php#L19 to be:

$newtag = ( new fi_mod_all_xpath() )->perform_filter( $html, $config, $settings );

Will switch it over to the XPath Filter - All-items https://github.com/feediron/ttrss_plugin-feediron/tree/master/filters/fi_mod_all_xpath that I've classed as experimental for quite a while now but should be ok to use. You should be able to define the join_element https://github.com/feediron/ttrss_plugin-feediron/tree/master/filters/fi_mod_xpath#join_element---join_elementstr and then split the tags with that join_element

dugite-code commented 2 years ago

I might make this a switch for the tag filter. Would be a good enhancment

benfishbus commented 2 years ago

Interesting. I had stumbled across this experimental filter in my testing, and it didn't work for me. I assumed it could be invoked within the "tags" filter as "type": "all_xpath".

After making that edit to Line 19, this now works:

    "tags": {
        "type": "xpath",
        "replace-tags": true,
        "xpath": [
            {
                "xpath": "*[@class='topic-list-item']",
                "index": "all"
            }
        ],
        "join_element": ",",
        "split": ","
    }

Thank you! 😎