Add the ability to pass a custom tokenizer to disambiguate via an argument

davedgd commented 7 years ago

It would be nice to have the ability to pass a different tokenizer to the disambiguate function for better compatibility when using different tools (e.g., when using pre-tokenized text simply split on whitespace, or pass an alternative tokenizer to NLTK's word_tokenize [e.g., Stanford]). This is important since some tokenizers produce a different set/number of tokens based on internal rules, which can lead to inconsistency in terms of how many tokens are being returned by disambiguate vs. other tools.

alvations commented 5 years ago

It's a year late but still thank you @davedgd!

Merge via https://github.com/alvations/pywsd/commit/42bdc97b93de67c3464ba288069501e92df76376 since your no longer have the repo on your github account =)

davedgd commented 5 years ago

Thank you! No worries about being late; I appreciate your having implemented this. I came up with an easier solution to fix the problem in my code, but this is even better! =)

alvations / pywsd

Add the ability to pass a custom tokenizer to disambiguate via an argument #35