ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

Doesn't seem to work for sites that use <div> tags instead of <p> #60

Open iannshan opened 7 years ago

iannshan commented 7 years ago

I tried this with a CNN.com article and it didn't work because they don't use paragraphs. Any suggestions for a work-around?

ageitgey commented 7 years ago

What specific article?

iannshan commented 7 years ago

This is the one I tried: http://cnn.com/2016/11/01/politics/hillary-clinton-2016-campaign/index.html

The output was just the first paragraph of the article, which actually is in a p tag unlike the rest of the article.

I tried a number of other sites including Medium, NBC News, and a few random blogs and they all worked great. When I inspected the article on CNN though I saw their use of div tags and figured that could be the problem.