eafer / rdrview

Firefox Reader View as a command line tool
Apache License 2.0
836 stars 35 forks source link

rdrview does not extract titles #39

Open eliobtl opened 1 week ago

eliobtl commented 1 week ago

Hi ! Thanks for rdrview.

I found that, on some websites, it does not extract titles. An example: this article looks normal in firefox reader view : screenshot-24-06-25-18-52-21

but with rdrview, there are no titles, only paragraphs: screenshot-24-06-25-18-53-02

On other websites, it sometimes displays subtitles normally but not the main title.

I use rdrview build from latest commit with gcc on alpine linux x86_64.

If you have an idea on why this happens, I would be happy to know.

eafer commented 3 hours ago

What goes wrong here is that the page you link is using h1 tags for the section titles, and rdrview expects that to be used only for the main title, so they get removed. It seems that firefox used to have this issue too, but it got fixed a few years ago: https://github.com/mozilla/readability/commit/11093f011f57fa528a0. So I need to port that patch for rdrview, but it's not trivial because it uses a unicode regex.