fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
365 stars 255 forks source link

Fix tags when styled by CSS instead of using semantic HTML #1117

Closed digicommons closed 1 year ago

digicommons commented 1 year ago

Some websites don't use proper semantic HTML, which can lead to losing relevant formatting when using FTR. This is most apparent with interviews where a question is formatted differently from an answer with CSS (using font-weight/font-style instead of <strong> or <em>).

Is it possible to replace say <p class="bold">...</p> with <strong>...</strong>? Using replace_string seems not to work here, as the closing tag needs to be targeted as well.

HolgerAusB commented 1 year ago

That should work, if the class-attribute is really the first one after the <p:

find_string: <p class="bold"
replace_string: <strong

The closing tag should be handled automatically, while user fivefilters doesn't trust this. Maybe you want to give us an example URL?

or you could just replace the attribute:

find_string: class="bold"
replace_string: style="font-weight: bold"
digicommons commented 1 year ago

Unfortunately, in the case of this Seznam Zprávy article the replacing doesn't work as desired. Instead of just targeting bold spans with the class e_ac, all subsequent <span> tags are substituted with <strong> as well.

I am using the following config:

find_string: <span class="e_ac"
replace_string: <strong

I thought about replacing the class with a style attribute as well, clever solution, thank you! But Wallabag seems to ignore this, when parsing.

HolgerAusB commented 1 year ago

AFAIK Wallabag removes/replaces ALL span-tags. That is a problem on many sites. So you CAN'T use a span in find_string

digicommons commented 1 year ago

You are right, just confirmed this behavior. Closing as this is an issue of Wallabag and/or php-readability

digicommons commented 1 year ago

Adding tidy: no (docs) solved the issue in the case of the website mentioned above.

HolgerAusB commented 1 year ago

@digicommons, nice to know, thanks. I did some aditional tweaks, because the main/top image was missing