ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 223 forks source link

Purpose of various cleaner functions such as cleaner.cleanEmTags? #66

Open bradley-curran opened 7 years ago

bradley-curran commented 7 years ago

First, thanks for making this public, it's a really useful tool. Apologies in advance if I have misunderstood the code.

A number of methods inside cleaner.coffee don't make sense to me. A good example is cleanEmTags. Which sites have an <img> under an <em>?

I noticed that a lot of the cleaning operations in cleaner.coffee was in the original commit. What documents did you use on the initial version?

If I get a better understanding I'd be happy to add comments to make it clearer.