Use python-boilerpipe or python-goose to extract the main text body from articles

MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques

7 stars 0 forks source link

Use python-boilerpipe or python-goose to extract the main text body from articles #90

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Alot of the current noise stems from the fact that text which is not part of the main article is included with the tokenisation process. These two python libraries extract the main page text which is very useful for your use case.