feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
204 stars 34 forks source link

Some Plugin Questions #164

Closed Organizer21 closed 3 years ago

Organizer21 commented 3 years ago

Hi there, apologies is this is not the right place for questions.

I was hoping I could ask two quick (updated with a few more) questions about this plugin.

  1. it's my understanding it integrates with readability but does not require it? Does it also support e.g. mercury_ fulltext? I ask as I am mainly as I ended up using mercury full text plugin due to readability failing to fetch the text for me on many sites.

  2. I have also been using TT plugin called "feedcleaner" for years, though as it's not been updated for a long time I am considering feediron. With "feedcleaner" it seems to me it is only able to clean the RSS feed itself, so with mercury or readability triggering it is useless. From what I understand, feediron also cleans what readability fetches?

  3. Reading the configuration example and some external comments I am a bit confused; can feediron visit a site and pull full text and multi pages on its own or does that always require the optional readability plugin for that?

  4. I do a lot of cleaning with feedcleaner, some 1000+ separate basic regex tweaks through its JSON (example below):

    { "URL": "adventurespiele.net", "type": "regex", "pattern": "#htm.?(.?)#is", "replacement": "htm<![CDATA[$1]]>" }, { "URL": "canaltech.com.br", "type": "regex", "pattern": "#[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]#", "replacement": "" },

From what I gather e.g. these can these could be rewritten to a feediron format rather easily also using regex?

  1. I also run multiple extensive regex matches on e.g. keywords in selects titles or links on thousands of sites... a match trigger the mercury_fulltext plugin (or readability plugin) through the built in TT FILTERS options. Would I still be doing that separate to feediron, or would I update those to trigger feediron to get full text and then also have the ability to manipulate/clean the code of select sites that trigger that way?

Sorry if not explained well enough, hopefully understandable though.

Chris

dugite-code commented 3 years ago

Question 1.

There are two versions of Readability Readability.php an optional install, but Highly recommended, and the very old but built in readability module. The main reason for the optional install of Readability.php is to offload it's maintenance to the upstream project.

When you install it it's simply downloaded php files like the FeedIron plugin. The most used module would be the xpath module anyway.

Question 2.

Feediron doesn't work on the article stub included in the RSS feed. It reaches out to the website and grabs the page, then processes it.

If you trigger an xpath filter on the readability output you have access to the more advanced cleanup functions for example:

{
        "type": "readability",
        "prependimage": true,
        "appendimages": true,
        "xpath": "*",
        "cleanup": [
            "span[text()='Loading']",
            "*[contains(text(),'Edition newsletters')]",
            "h2[contains(text(),'Most Viewed')]",
            "nav"
        ]
}

Question 3.

Feediron uses xpaths for multipage processing and doesn't need readability.php to achieve this.

Question 4.

Basic search and replace along with regex search and replace is possible see: reformat / modify - "reformat":[array of options] "modify":[array of options]

Again doesn't require Readability.php The main issue you will encounter is character escaping your regex to fit in the json config format.

Question 5.

Currently it's not possible to run FeedIron against the mercury_fulltext plugin output as it fully replaces the body of the article.

Organizer21 commented 3 years ago

@dugite-code Thank you for taking the time to answer me, much appreciated... letting all of this sink in before making a decision :) though I see I might have to run a few test before making the call. Maybe the external readability.php fixes some of the original issues I had with the built in one and reason I got started with mercury_fulltext.

On that topic of challenges; mercury_fulltext seems to have issues with this one (it pulls the text, but also text from multiple articles below it): https://www.thumbsticks.com/nintendo-switch-releases-september-21-25-2020-09242020/

Not sure if Readability.php + feediron would have my back in such a case also, or if also a problem.

dugite-code commented 3 years ago

The problem with readability is it will strip most of the classes so it can make doing xpath cleanup a pain. Ideally once I get around to finalizing the execution order changes we'll be able to have cleanup run on the source first followed by using readability.

In the mean time a simple xpath fetch works well on the example page:

{
    "type": "xpath",
    "xpath": "div[@id='mvp-content-main']",
    "cleanup": [
        "div[contains(@class,'code-block')]",
        "hr",
        "h2[contains(text(),'More from Thumbsticks')]",
        "p[contains(text(),'Save the Thumbsticks')]"
    ]
}