bmoscon / ArticleParse

Heuristic text extraction from news sites in Python3
Other
9 stars 4 forks source link

Additional Shallow Text Features Required #3

Closed bmoscon closed 10 years ago

bmoscon commented 10 years ago

For improved classification, we need to calculate and store additional information about the shallow text features.

These include: average word length (in a block), average sentence length, absolute position of the block in the document (i.e. main text is most likely to be near main text, template code most likely to be near template code, etc), number of uppercase words in the block, and ratios of words to sentence delimiters

bmoscon commented 10 years ago

added average word length, average sentence length, sentence count, and word count.

bmoscon commented 10 years ago

committed in multiple parts