fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
365 stars 255 forks source link

heise: only strip first ancestor #1081

Closed jangernert closed 1 year ago

jangernert commented 1 year ago

I'm using libxml2 to process the heise.de config. And libxml2 seems to match EVERY <div> that is an ancestor of the matching <section> which in turn eliminates most of the html document.

Does Full-Text RSS do something to prevent this or would the same issue apply?

HolgerAusB commented 1 year ago

@jangernert do you have a link to an example article? I could not find one

jangernert commented 1 year ago

@HolgerAusB for example this one: https://www.heise.de/news/Neue-Smartphones-Google-stellt-das-Pixel-8-und-das-Pixel-8-Pro-vor-9324936.html

HolgerAusB commented 1 year ago

I don' t see any difference in wallabag 2.6.7 or Fulltext-RSS between strip: //div/section[@data-component='TeaserList']/ancestor::div or strip: //div/section[@data-component='TeaserList']/ancestor::div[1] or even after removing this line completely.

So it would not harm FTR or wallabag to set this '[1]'

@jangernert Could you please prove, if this line is still needed in your use-case?

jangernert commented 1 year ago

@HolgerAusB Just tested it: libxml2 still matches all ancestor <div> elements without the limitation to [1] for me.