buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.65k stars 348 forks source link

Remove distracting and unnecessary tags #122

Open rien333 opened 5 years ago

rien333 commented 5 years ago

SVGs often render way too big on most websites (see e.g. github and the mozilla docs, see also the screenshot below), providing quite the distraction. Moreover, they are generally non-informative parts of a website (I'm no web developer, but one common use is providing structural rather than contentful information, such as scaleable buttons/ui elements). If someone knows websites that use a lot of non-distracting SVGs removing or altering them is obviously not a good idea.

broken svg

Other HTML tags that don't add anything to the readability of a page are <input> and <button> tags. I therefore propose to delete them. I'm willing to create a PR, given that I've already tried this for a script in qutebrowser.

Alternatively, SVGs could be scaled to make them less distracting. Looking at the SVG documentation, I suppose one could do something with the width attribute of SVGs. Not really sure how you should infer a good width value for a given SVG and browser window size though (someone who knows something about web development might, however). For reference, I've included a screenshot of the SVG above with a changed width value. To be honest, even though the SVG looks much better, I do not see in what way it enhances readability (at best, it's not distracting, but without much purpose).

smaller SVG

rien333 commented 5 years ago

@buriy No opinion?

buriy commented 5 years ago

well, this is not the goal of the project. I mean, you can take HTML output and strip offf any part you don't like, or set resolution for the images. If you need to display it correctly -- maybe it's even better to use CSS for that. The library won't be able to guess what is needed and what is not for a specific use case, that's why it shouldn't have an opinion on that.

rien333 commented 5 years ago

Understood, sorry for bothering. Still, I don't think SVGs enhance readability in the general case.

buriy commented 5 years ago

Sorry, maybe I haven't expressed my thoughts clearly. If you think that based on your experience it's a typical case, we can have an option for that: (fix_svg='remove' / 'leave' / '160x120' or just strip_svg=True/False). If you will make a PR, I'll pull it. But if it's a really rare case or if there are obvious counterexamples, and most users won't like the defaulting it to "True" -- then maybe it's not the proper way of solving this. However, you can still have an option "strip_svg=", defaulting to False. Alternatively, we can have: strip_images=True , strip_images=['png', 'svg', 'ico'] , etc. So please do some analysis of the use cases and make a PR, I'll accept it. At the moment you have shown one example and made no analysis on how often it happens and whether True is a good default for this, you just made an argument that you think (means, it's true only for you) that SVG is not enhancing readability in the general case. I think a good library shouldn't have opinions. But sensible defaults (that you could edit) make the library easier to use for end-users, so it's a good idea to make them.

rien333 commented 5 years ago

Fair, I suppose I could craft a more thorough PR. Your suggestions on how to implement this seem fine.

The reason for intially proposing to remove SVGs came from googling around on typical use cases for SVGs. The results made it seem as if they are not that informative - icons, logos, lines and animations seem to be the most typical uses. So my argument for removing them was mostly "a priori", so to speak. I really tried to find counter examples, but SVGs in the bodies of documents seem to be fairly rare I guess? That's also part of the reason why I've listed only two examples.

Also note that this issue mentions <button> and <input> tags, two tags I don't really see a place/logical role for in this library.

buriy commented 5 years ago

In real projects I just use https://bleach.readthedocs.io/en/latest/ and remove all tags but [b, i, a, h1-h6, em, s, p, br] (after processing with readability). You can also specify what attributes to keep. It doesn't have img extension filter, but I would use it to filter out button and input if I needed.