For improved classification, we need to calculate and store additional information about the shallow text features.
These include: average word length (in a block), average sentence length, absolute position of the block in the document (i.e. main text is most likely to be near main text, template code most likely to be near template code, etc), number of uppercase words in the block, and ratios of words to sentence delimiters
For improved classification, we need to calculate and store additional information about the shallow text features.
These include: average word length (in a block), average sentence length, absolute position of the block in the document (i.e. main text is most likely to be near main text, template code most likely to be near template code, etc), number of uppercase words in the block, and ratios of words to sentence delimiters