feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
206 stars 34 forks source link

tidy borked my rules (recent pull changed behavior) #145

Closed iWantToKeepAnon closed 4 years ago

iWantToKeepAnon commented 4 years ago

Prerequisites

Expected Behavior

Rule on gocomics returns og:image in an <img/> element.

Current Behavior

Empty output. Seems like tidy drops the whole pile of HTML on the floor (i.e. it returns the empty string) but doesn't trigger an exception.

Steps to Reproduce

Make sure tidy is installed in php and function 'tidy_parse_string' is available.

Run url: https://www.gocomics.com/calvinandhobbes/2020/01/06

Against rule:

{
    "type": "xpath",
    "xpath": "meta['og:image' = @property]\/@content",
    "start_element": "<img title='gocomics' src='",
    "end_element": "' \/>"
}

Failure Information (for bugs)

I recently did a fresh pull. This seems to be the culprit : https://github.com/feediron/ttrss_plugin-feediron/commit/7e538de5a7ccffda986f8af3fa33a29b7cecbda5#diff-e9fa0d13b445ed0246ee0003db4374d6

  • Set tidy-source true by default

Why is a default behavior change a good idea? I added "tidy-source": false to all my rules and the above works again. I have a few rules and adding that to each was a bit of a pain, but I am functional again.

Seems like a bad idea to me!

Paste Configuration here (see above)

dugite-code commented 4 years ago

The reason tidy was made default was along the same lines as why firefox automatically fixes broken web pages, I was encountering too many issues with broken websites that were easily solved with a quick tidy.

I'm not going to revert the change, however what I'll do is introduce several global switches (like the debug:true) switch so you can change the default behavior

iWantToKeepAnon commented 4 years ago

I'm not so concerned about the switch as I am that tidy failed and returned a zero length string. Perhaps a small size check and if tidy fails (without throwing an exception), then keep the original html string. (?)

dugite-code commented 4 years ago

I've added a check to catch if tidy removes more than half of the html. I still need to fix the defaults before I merge to master

iWantToKeepAnon commented 4 years ago

Perfect, thanks for a quick turn around! : ))