Expose regex token_pattern

TimSchopf / KeyphraseVectorizers

Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.

BSD 3-Clause "New" or "Revised" License

254 stars 34 forks source link

Hello!

Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.

The workaround I have found so far has been to use a custom POS tagger (custom_pos_tagger param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!

Thanks

TimSchopf / KeyphraseVectorizers

Expose regex token_pattern #20