bmoscon / ArticleParse

Heuristic text extraction from news sites in Python3
Other
9 stars 4 forks source link

new classifier #6

Open bmoscon opened 10 years ago

bmoscon commented 10 years ago

now that new features are detected, we need a new classifier to classify each section as boilerplate or content

bmoscon commented 9 years ago

current classifier only is using anchor density and word count. Need to include sentence analysis (number and average length), average word length, number of upper case words, anchor count, and section position

bmoscon commented 9 years ago

started. also added stop word analysis