edponce / FACET

Framework for Annotation and Concept Extraction in Text
Other
2 stars 0 forks source link

Add phrase-based tokenizers #3

Closed edponce closed 4 years ago

edponce commented 4 years ago

Current tokenizers only output single word tokens.

edponce commented 4 years ago

Extended tokenizers to support 'window', 'min_token_length', and 'converters' parameters. The 'window' parameter allows specifying the maximum number of words to consider as a single token and is applied as a running window with overlaps.

Would it be useful to support a fixed window, that is, only phrases are used?

edponce commented 4 years ago

Updated spaCy and NLTK tokenizers with chunking modes based on nouns only, noun phrases, and POS-based phrases.