ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 223 forks source link

Grabbing sidebar content #70

Open adamrabie opened 7 years ago

adamrabie commented 7 years ago

I noticed while parsing the url below that sidebar content sometimes get drawn into the article content.

http://news.forexlive.com/!/anz-on-gbp-mkts-need-something-fresh-to-trade-off-if-gbp-is-to-go-lower-in-near-term-20170117

I'll be looking into the code here but anyone more familiar with it who can beat me to it is much appreciated.

If someone is interested in optimizing the content extractor for a bunch of URLs i'm commonly parsing, i'd be interested in paying a freelance rate. Not looking for overfitting but hoping to improve this repos general capacity to handle varying content schemas.