Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

h1 h2 h3 tags are removed #48

Open anthony-foulfoin opened 6 years ago

anthony-foulfoin commented 6 years ago

I don't know if it is on purpose or not, but the h* tags are removed from the parsed articles. For instance: http://www.liberation.fr/france/2017/11/24/chomage-toujours-fluctuant-a-nouveau-a-la-hausse_1612338 All the h2 and h3 tags are removed.

To fix it, I used a custom div2p regexp: this.regexps.div2p(/<(h1|h2|h3|h4|h5|h6)/); but I was wondering if it should not be part of the defaults ?

Tjatse commented 6 years ago

it should be, thx