Closed edponce closed 4 years ago
Extended tokenizers to support 'window', 'min_token_length', and 'converters' parameters. The 'window' parameter allows specifying the maximum number of words to consider as a single token and is applied as a running window with overlaps.
Would it be useful to support a fixed window, that is, only phrases are used?
Updated spaCy and NLTK tokenizers with chunking modes based on nouns only, noun phrases, and POS-based phrases.
Current tokenizers only output single word tokens.