MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set
GNU General Public License v3.0
3 stars 1 forks source link

Added method to shrink down web page content to only relevant bits #202

Closed felixschorer closed 7 years ago

felixschorer commented 7 years ago

Added a method to WebPage which shrinks its content based on its average line length. It basically removes the shortest line until a certain threshold is met.

felixschorer commented 7 years ago

All changes have been extensively tested while producing test results for #196