fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
365 stars 255 forks source link

Create dice.com.txt #1251

Closed digicommons closed 11 months ago

HolgerAusB commented 11 months ago

@digicommons same question here. Why stripping featured-image?

In this case, you did not specify a body selector and Fulltext-RSS (FTR) as well as wallabag finds it themself. Both originally don't contain that image.

But FTR is somewhat special at this point. If the found body does not contain any image, it tries to get an image from the html-header, here from <meta property="og:image"... and puts that in front of the article. You can prevent this by setting a line into your site config file: insert_detected_image: no

digicommons commented 11 months ago

Thanks for the tip! Since insert_detected_image is not included in the FTR site patterns docs, I didn't know about this pattern. Is there an overview of configs, I might've overlooked?

I didn't specify a body selector since FTR/Wallabag did a good job with extracting the desired sections. For stripping the lead/featured-image, please see my answer here.

HolgerAusB commented 11 months ago

Is there an overview of configs, I might've overlooked?

unfortunately not. I've learned a lot by reading the issues here and in wallbag/wallbag and in the Fivefilters-Forum

digicommons commented 11 months ago

Inspired by the lack of an overview, I just ran grep on the site configs and extracted what looked like config directives. Maybe having such a list could be useful?

ftr-site-config-directives