feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
204 stars 34 forks source link

Error "Applying Character set: utf-8" for theguardian.com #157

Closed wfriesen closed 4 years ago

wfriesen commented 4 years ago

I'm seeing this error in the System Log when running for theguardian.com:

Applying Character set: utf-81. plugins.local/feediron/bin/fi_logger.php(33): trigger_error(Applying Character set: utf-8, 1024)
2. plugins.local/feediron/bin/fi_helper.php(59): log(1, Applying Character set: utf-8)
3. plugins.local/feediron/filters/fi_mod_xpath/init.php(8): getDOM(
<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/australia-news/2020/aug/16/coalition-must-ensure-australia-wont-be-at-end-of-queue-for-coronavirus-vaccine-labor-says">
<head>

(full log truncated)

This occurs while pulling feeds during the usual update, and also while running in the FeedIron Testing tab.

This seems to occur regardless of the config, but the simplest one I've used to cause this is:

{
    "type": "xpath",
    "xpath": [
        "article"
    ]
}

It seems to happen against any URLs appearing in the guardian feed, but one example is: https://www.theguardian.com/australia-news/2020/aug/16/coalition-must-ensure-australia-wont-be-at-end-of-queue-for-coronavirus-vaccine-labor-says

I'm running the latest versions: tt-rss: v20.08-5497a137d feediron: latest master (51b34463b30eecdf09d64d481d3220853405217f)

dugite-code commented 4 years ago

This is normal behavior as it's the way the debugging logs to the TT-RSS system log by throwing a E_USER_NOTICE error and it's not a real error. "debug": false should be the default but you can add it to your main config to be safe. When ruining in the testing tab this will still always show.

Related enhancement issue #122

wfriesen commented 4 years ago

Thanks, that clears up the logging issue, but now it appears that feediron is not being applied when the updater runs.

I started from scratch with a blank database, added the plugin and my settings, then imported my feeds. After the update ran nothing was filtered by feediron. If I go to the Feed Debugger, check Force Refresh, and then Continue, I can see lines like

[07:00:42/43] hash differs, applying plugin filters:
[07:00:42/43] ... Feediron
[07:00:42/43] === 0.1244 (sec)
[07:00:42/43] plugin data: feediron,

and the relevant feed items are filtered according to the config.

Is there a reason this would work from both the testing tab and when forcing a refresh, but not during the regular feed updates?

dugite-code commented 4 years ago

Feediron always runs for all feeds as it checks to see if the article url matches the config url. The only reason it won't replace the feed body is if it can't match the url link to the article with the url in the config. I'm using very simple text matching to do this so keep it as simple as possible i.e. don't use https://www.theguardian.com use theguardian.com

Your main config should look something like this.

{
    "theguardian.com":{
        "type": "xpath",
        "xpath": [
            "article"
        ]
    },
    "debug": false
}

The only time I've encountered issues is when the site uses a service like feedburner that obfuscates the original article url. Using the plugin af_unburn solves that however.

As a side note: I didn't know the feed debugger even existed... well that's embarrassing and useful

Edit: checking a feed I use Feediron for the average processing time is around 0.3 - 0.5 (sec) where as a feed it's not running on is around 0.0007 (sec)

wfriesen commented 4 years ago

This ended up being due to incorrect permissions, I think. I'm running ttrss in docker containers on AWS, and mounted in the plugin directory as a volume from the host. After updating the scripts to clone feediron in the same way as the ttrss code itself, everything is working perfectly.